arindam

Verified account

@_arindam

Generative AI solution engineer | Building Agents

blr

Joined April 2009

222 Following

144 Followers

1.6K Posts

Pinned Tweet

11 months ago

I think the final battle will always be with Gemini and Grok. Claude will remain as a protected model and target enterprises with slow careful progress. OpenAI may still lead on active user base since ChatGPT term became a synonym to any AI. Although their models may stay stagnant for sometime. Meta may come-up with something. Time will tell.

0

0

0

0

227

20 days ago

@arpit_bhayani Yes. CLI is like a gift for the Agents.

0

0

0

0

93

_arindam retweeted

Muratcan Koylan

28 days ago

Gradient descent for SKILL.md files sounds interesting, maybe a bit complex but it's becoming a real part of agent harness. SkillOpt is one of the first papers to treat markdown skill files as trainable parameters and provides a proper optimization framework for them. A few things I learned that you should consider too. 1. The validation gate is the only thing that matters in a self-editing loop. Held-out set, strict improvement, ties rejected. End-to-end, their best skills land with 1 to 4 accepted edits total. If your "self-improving agent" is accepting most of what it proposes, you're shipping slop. 2. Bounded edits are better than full rewrites. 4 to 8 edits per step is the sweet spot. Remove the budget and performance collapses. This is the textual analog of learning rate, and it transfers to any LLM-as-author loop. If you're using an agent to refactor your docs, your prompts, or your skills, cap the diff size. 3. Compactness wins. Median final skill: ~920 tokens. Skills do not need to be long. They need to be high-signal. Most skill files I see are bloated because length feels like effort. It isn't. 4. The harness is becoming less important; the skill is becoming more important. A Codex-trained skill ported into Claude Code hit +59.7 points on SpreadsheetBench. Procedural knowledge is more general than the runtime that produced it. 5. Frozen model + trained context is the practical adaptation. GPT-5.4-nano with a SkillOpt'd skill ≈ frontier behavior on procedural benchmarks. Cheaper, portable, inspectable, zero inference-time cost. This is the answer to "how do we adapt a frontier model for our domain" for almost everyone who isn't training their own models. 6. Verification is the bottleneck. Every gate in this paper depends on an auto-grader. That works for benchmarks. It fails for writing, design, and strategy, exactly the open-ended work we want to automate. Whoever builds the verifier for open-ended tasks owns the next stage. There are also two leassons I learned while shipping v2.3.0 of my Context Engineering Agent Skills repo, measured across composer-2, claude-opus-4-7, gpt-5.5, and gemini-3.1-pro via the @cursor_ai SDK: - Description and body are two different surfaces. The router only sees the description. The agent sees the body once activated. They can quietly disagree, and only end-to-end task tests catch it. - Aggregate accuracy is the wrong unit. When I rewrote three descriptions, the corpus average moved ~1pp. Individual skills moved 23–25pp. Per-skill effect size is where the action is. Also, in Feb 2026 I shared a piece called Personal Brain OS arguing that the markdown file is a first-class substrate for agent state. SkillOpt is the optimizer-shaped version of that same argument: not "store memory in files" but "treat files as trainable parameters with proper optimization machinery around them." That's the move from static to measured. The fast/slow split they describe already lives implicitly in the digital-brain-skill repo: - voice-guide and tone-of-voice.md are slow-state (rarely touched) - posts.jsonl and bookmarks.jsonl are fast-state What SkillOpt adds that I didn't have is a protected section invariant, a structural guarantee that fast edits cannot overwrite slow lessons. Removing that mechanism cost them 22 points on SpreadsheetBench. Worth borrowing. If you're building agents, SkillOpt: Executive Strategy for Self-Evolving Agent Skills is a good paper to read: https://t.co/ZS9SZXQ6Mv

koylanai's tweet photo. Gradient descent for SKILL.md files sounds interesting, maybe a bit complex but it's becoming a real part of agent harness.

SkillOpt is one of the first papers to treat markdown skill files as trainable parameters and provides a proper optimization framework for them.

A few things I learned that you should consider too.

1. The validation gate is the only thing that matters in a self-editing loop.

Held-out set, strict improvement, ties rejected. End-to-end, their best skills land with 1 to 4 accepted edits total. If your "self-improving agent" is accepting most of what it proposes, you're shipping slop.

2. Bounded edits are better than full rewrites. 4 to 8 edits per step is the sweet spot.

Remove the budget and performance collapses. This is the textual analog of learning rate, and it transfers to any LLM-as-author loop. If you're using an agent to refactor your docs, your prompts, or your skills, cap the diff size.

3. Compactness wins. Median final skill: ~920 tokens.

Skills do not need to be long. They need to be high-signal. Most skill files I see are bloated because length feels like effort. It isn't.

4. The harness is becoming less important; the skill is becoming more important.

A Codex-trained skill ported into Claude Code hit +59.7 points on SpreadsheetBench. Procedural knowledge is more general than the runtime that
produced it.

5. Frozen model + trained context is the practical adaptation.

GPT-5.4-nano with a SkillOpt'd skill ≈ frontier behavior on procedural benchmarks. Cheaper, portable, inspectable, zero inference-time cost. This is
the answer to "how do we adapt a frontier model for our domain" for almost everyone who isn't training their own models.

6. Verification is the bottleneck.

Every gate in this paper depends on an auto-grader. That works for benchmarks. It fails for writing, design, and strategy, exactly the open-ended work we want to automate. Whoever builds the verifier for open-ended tasks owns the next stage.

There are also two leassons I learned while shipping v2.3.0 of my Context Engineering Agent Skills repo, measured across composer-2, claude-opus-4-7,
gpt-5.5, and gemini-3.1-pro via the @cursor_ai SDK:
- Description and body are two different surfaces. The router only sees the description. The agent sees the body once activated. They can quietly disagree, and only end-to-end task tests catch it.
- Aggregate accuracy is the wrong unit. When I rewrote three descriptions, the corpus average moved ~1pp. Individual skills moved 23–25pp. Per-skill effect size is where the action is.

Also, in Feb 2026 I shared a piece called Personal Brain OS arguing that the markdown file is a first-class substrate for agent state. SkillOpt is the optimizer-shaped version of that same argument: not "store memory in files" but "treat files as trainable parameters with proper optimization machinery around them." That's the move from static to measured.

The fast/slow split they describe already lives implicitly in the digital-brain-skill repo:
- voice-guide and tone-of-voice.md are slow-state (rarely touched)
- posts.jsonl and bookmarks.jsonl are fast-state

What SkillOpt adds that I didn't have is a protected section invariant, a structural guarantee that fast edits cannot overwrite slow lessons. Removing that mechanism cost them 22 points on SpreadsheetBench. Worth borrowing.

If you're building agents, SkillOpt: Executive Strategy for Self-Evolving Agent Skills is a good paper to read: https://t.co/ZS9SZXQ6Mv

56

2K

243

5K

773K

28 days ago

@arpit_bhayani Thanks. Will be there around July-Aug.

0

1

0

0

137

Who to follow

Trade Gopher 🇺🇦

Deep value investor using AI and forensic accounting tools to identify over/undervalued companies. DD shared publicly. All posts my opinions, not fin'l advisor

DebugPoint | Linux & Dev Portal

Fastest growing portal about Linux, opensource, and stuff! Follow us and mail: [email protected] News: https://t.co/XBI2MFSXCh

Abdullah Al-Shammari | عبدالله أحمد الشمري

@AlephAlshammari

أ. مساعد بجامعة الكويت | دكتوراه بالرياضيات البيولوجية من جامعة أكسفورد | باحث بمعهد دسمان للسكري

about 1 month ago

0

0

0

0

13

about 1 month ago

@arpit_bhayani Thank you for sharing. Fascinating redis internals

0

0

0

0

4

about 1 month ago

@arpit_bhayani Happy Birthday 🎂

0

0

0

0

24

about 1 month ago

Just realised I joined Twitter/X before Elon. Its been so long

0

1

0

0

35

about 1 month ago

@arpit_bhayani Thanks for sharing. Redis is such an amazing design. Will watch.

0

1

0

0

366

about 1 month ago

@arpit_bhayani Yes, facing this everyday. When you scale Agents, we fall back to basics of system design. The queues, DLQ, async comms between agents, redis, agent failures and restarts, checkpoints. Managing this at scale requires fundamentals. Its always core CS.

0

1

0

0

161

about 1 month ago

Server sent (sse) streaming protocol is just magic

0

1

0

0

36

about 1 month ago

@adamgordonbell What the ... Crazy

0

0

0

0

60

about 2 months ago

@opentelemetry has such a brilliant design. While tracing agents, it was an absolute treat to see the beauty of its abstractions, simplicity, and overall architecture. @Microsoft’s Agent Framework implements it in such an elegant way with built-in OTel trace emissions. Makes tracing and visualizing in @grafana via @Azure incredibly smooth. I’m completely blown away! 🔥

0

0

0

0

19

2 months ago

Ran into Boyar moore algorithm paper today, while searching for a faster string search algorithm. And whoa, it is implemented in grep!

0

1

0

0

63

2 months ago

@Tigularius Hey!

0

0

0

0

2

2 months ago

@RusUslada Hello!!

0

1

0

0

9

2 months ago

0

1

0

0

20

2 months ago

1

1

0

0

132

2 months ago

@birds_justice Really its like suddenly I can read mind of many people and see their thoughts.

0

0

0

0

37

Last Seen Users on Sotwe

Trends for you

Most Popular Users