@fofrAI@paulopacitti "I use Gemini 3 Pro to generate a JSON prompt because it has the best vision capabilities (at the moment)."
Does this stand still today, or there are new leaders in that regard ?
Progressive disclosure is not reliable because LLMs are inherently lazy.
"In 56% of eval cases, the skill was never invoked. The agent had access to the documentation but didn't use it."
Vercel ran evals on Next.js 16 APIs that aren't in model training data to test whether agents could learn framework-specific knowledge through Skills vs. persistent context.
Skills are the "correct" abstraction: package domain knowledge, let the agent invoke it when needed, minimal context. The agent decides when to retrieve.
They work well WHEN the user triggers them; otherwise, LLMs just ignore them.
Vercel's benchmarking is the first experiment of this kind I've seen, and it's actually interesting.
- Baseline (no docs): 53%
- Skill (default): 53%
- Skill with explicit instructions: 79%
- AGENTS[.]md with 8KB compressed docs index: 100%
The skill approach assumes agents reliably recognize when they need external knowledge and act on it. They don't.
"You MUST invoke the skill" made agents read docs first and miss project context. "Explore project first, then invoke" performed better. Same skill, different outcomes based on prompting.
The winning approach removed the decision entirely. An 8KB compressed index embedded in AGENTS[.]md, with one instruction: "Prefer retrieval-led reasoning over pre-training-led reasoning."
Two agent design learnings:
1. Passive context beats active retrieval for foundational knowledge. Don't make the agent decide to look things up, make the index always present.
2. Compress aggressively. Vercel went from 40KB to 8KB (80% reduction) with zero performance loss. The agent needs to know where to find docs, not have full content in context.
The gap between "agent can access X" and "agent will access X" is larger than we assume.
I keep seeing similar findings across agent architectures. Kimi Swarm's orchestrator is trained specifically to avoid sequential execution. Without training, orchestrators default to serial processing, planning a list of steps and executing them one by one. It's the EASY path.
The agent defaults to the lazy path: hallucinating from training data rather than retrieving docs. Passive context removes the choice entirely; the agent doesn't decide whether to look things up; the index is already there.
We keep finding that the "smarter", more autonomous design (let the agent decide when to X) underperforms the "dumber" design (always X, or structurally enforce X).
WOW!!! If you have semantic memory tied to your UserPromptSubmit hooks, you MUST ALSO include it in your PreToolUse hook. I promise you - it will be an absolute GAME CHANGER. It will put your efficiency levels are over 9,000 (*vegeta voice*).
How many times have you sat there, watching Claude code go through an extended workflow, just to notice it start to go down a path you just KNOW will be error filled - and subsequently take it forever to FINALLY figure it out?
The problem with relying strictly on the UserPromptSubmit hook for semantic memory injection is the workflow drift from your original prompt. The memories it injects at the initiation of your prompt will be less and less relevant to the workflow the longer the workflow is.
Claude has a beautiful thing called thinking blocks. These blocks are ripe for the picking - filled with meaning & intent - which is perfect for cosign similarly recall. Claude thinks to itself, "hmm, okay I'm going to do this because of this", then starts to engage the tool of its choice, and BOOM:
PreToolUse hook fires, takes the last 1,500 characters from the most recent thinking block from the active transcript, embeds it, pulls relevant memories from your vector database, and injects them to claude right before it starts using its tool (hooks are synchronous). This all happens in less than 500 milliseconds.
The result?
A self correcting Claude workflow.
Based on my testing thus far, this is one of the most consequential additions to my context management system I've implemented yet.
Photos: ASCII chart showing the workflow of the hook, and then two real use-cases of the mid-stream memory embedding actually being useful.
If you already have semantic memory setup, just paste this tweet and photos into Claude code and tell it to implement it for you. Then enjoy the massive increase of workflow efficiency :)
@shydev69 any write ups from their team on the reverse engineering of claude code ? curious about the internals of cc beyond just extracting the messages/tool calls from .jsonl sessions
Want to find out the Juice number (reasoning effort) for GPT-5-Thinking?
Try the prompt below. This is the equivalent of medium reasoning effort in the API
@sama@sama for love of god, please bump the input ctx window for Enterprise ChatGPT users, GPT 5 Thinking is somewhere around 50k tokens and GPT 5 (non thinking) is somehwere between 80-85k. Literally just tested few mins ago. PLEASE PLEASE PLEASE