"HydraHead: From Head-Level Functional Heterogeneity to Specialized Attention Hybridization"
This paper shows that the right unit for hybrid attention is the head, not the layer.
So you have a few retrieval-critical heads to keep Full Attention, while the rest switch to Gated DeltaNet Linear Attention. The FA heads are chosen with causal patching, then FA and LA outputs are fused with scale normalization.
This gives a 7 to 1 LA to FA model that matches a heavier 3 to 1 layer-wise hybrid, which improves 512K retrieval by over 69% with only 15B training tokens, and preserves reasoning.
Another important insight here is that they turned interpretability from an analysis tool into an architecture design tool.
As they didn't just use causal analysis to explain what heads do, it also uses that analysis to decide which heads deserve Full Attention and which can be replaced with Linear Attention for efficient 512K-context scaling.
how a simple llm wiki compares to gbrain
second brains are getting popular fast, they're one of the main enablers for ai and agents right now. the context you give an agent is what makes it good
two frameworks I've been using are the LLM Wiki and Gbrain. here's how they compare, and how to use both
underneath they're the same idea, karpathy's llm wiki: you compile raw sources into linked markdown pages your agent reads, instead of redoing RAG from scratch every time
both ingest your sources, build a graph out of them, and answer with citations, so the real question is what's actually different
an LLM wiki is just markdown and your agent:
> it reads your sources and writes linked pages
> you ask a question and it reads those pages to answer
> you keep it healthy with a lint pass
> there's no database, just files, and one user
it works well, and karpathy even points out where it starts to break down:
> the synthesis drifts after a lot of updates
> the context cost grows as the wiki gets big
> a wrong claim can harden into fact over time
gbrain is that same wiki with an engine built for those exact problems:
> better retrieval, vector plus graph plus a reranker, instead of the agent reading pages
> it runs on postgres, so it scales past what you could ever read yourself
> a 24/7 loop enriches and fixes the wiki on its own, so there's no manual lint
> every answer comes with sources and an honest note on what it doesn't know yet
> it's multi-user, with access scoped per person and team
when to reach for each:
> use an llm wiki for smaller projects, to gather and store the context an agent will use later on. when it grows up, you can ingest it straight into gbrain
> use gbrain for the consistent, shared things, a company brain or a client brain, especially once more people are involved
so it's not wiki vs brain, it's the same wiki run by you on a small project, versus the same wiki run by an engine at scale for a team
start simple, then move to gbrain when you outgrow the files
Anthropic wrote it down: prompt engineering was just the start.
Next comes context engineering. The discipline of curating what the model sees.
Their words: bloated context is the silent killer of agent reliability.
It isn't the model. It's what you feed it.
Same Claude for everyone. The gap is the operator.
Below: the 4 levels that close it.
Here is a condensed, high-impact version of the post:
Discover -> Hand off -> Verify -> Persist -> Schedule
This 12-page PDF completely changed how I build agentic systems.
Here is the 5-step blueprint:
Discovery: The loop reads CI, issues, and commits to find what's worth fixing.
Handoff: Each finding gets an isolated git worktree so parallel agents never collide.
Verification: A second agent - built to assume the code is broken, reviews the work.
Persistence: Results land on disk, never in a temporary context window.
Scheduling: Automation fires the entire process on a timer, making it a true loop.
The key insight: An agent grading its own work always praises it. You need that second agent to say "no."
Read it now, then explore the article below.
My entire AI stack is now Chinese 🇨🇳
87% cheaper. same revenue
swaps by task:
1. reasoning / backend brain
Opus 4.8 → Kimi K2.7
benchmark gap: ~8% · price: ~11x cheaper
2. code generation
GPT-5.5 → Qwen 3.7 Max
benchmark gap: ~18% · price: ~7x cheaper
3. agent loops + tool calling
Sonnet 4.7 → GLM 5.2
benchmark gap: ~3% · price: ~5x cheaper on input
4. cheap volume / bulk processing
GPT-5.5 mini → MiMo V2.5
benchmark gap: ~6% · price: ~12x cheaper
5. image generation
GPT-Image-2 → Wan 2.5
benchmark gap: ~5% · price: ~8x cheaper
6. video generation
Sora 2 → Kling 3.0
benchmark gap: roughly equal · price: ~6x cheaper
[ result after 30 days: ]
operating costs dropped 87%, output quality dropped 4% on average, revenue unchanged
the most important that these models will be not banned in a month and i can run them locally
nobody will steal my data and i can learn them as i need
full article drops tomorrow with:
> exact routing logic per task type
> the 2 cases where I still pay for American
> the migration playbook anyone can copy in a weekend
VERY IMPORTANT to get migrated now, while it's not too late
This paper completely changed how I think about trusting retrieval in RAG:
Fetch documents -> Score their quality -> Get a confidence -> Pick an action -> Clean the context -> Generate
Here is the 5-step blueprint:
Retrieval evaluator: a lightweight model scores the quality of the fetched docs for the query and outputs a confidence degree.
Three actions: confidence triggers one of {Correct, Incorrect, Ambiguous}, instead of blindly stuffing everything in.
Web search on failure: if docs are bad, the query is rewritten and knowledge is pulled from large-scale web search rather than a static corpus.
Decompose-then-recompose: each document is split into minimal strips, relevant ones are kept, noise is dropped, the context is rebuilt.
Plug-and-play: all of this bolts onto plain RAG and onto Self-RAG with no retraining of the generator.
Key insight: the RAG problem is not only when to retrieve, but what to do when retrieval comes back wrong.
One lightweight evaluator with three actions lifts both plain RAG and SOTA Self-RAG across four datasets at once.
Read this, then check the article below.
If you build with MCPs, this one is worth reading.
(bookmark it)
The paper covers five recurring MCP server patterns across fifteen independently developed servers.
That taxonomy is useful because I see many AI teams rebuilding the same shapes without shared names.
If you are building MCP servers, this is a practical reference for deciding whether your server is exposing resources, orchestrating tools, managing sessions, aggregating proxies, or adapting a domain workflow.
Paper: https://t.co/yA6mxq2NEQ
Learn to build effective AI agents in our academy: https://t.co/1e8RZKs4uX
this is f*cking gold
Andrej Karpathy joined Anthropic five weeks ago.
A friend on his team just showed me the exact LOOPS.md file he actually uses.
I dropped it into my setup. The very first response was different.
Not slightly different. Completely different.
Claude stopped giving generic answers and started working exactly the way I think.
You don't talk to the model anymore. You build the system that talks to the model for you.
Bookmark it before it gets lost in your feed.
Read it now, then check the article below.
As an AI Engineer. Please learn
>Harness engineering, not just prompt engineering
>Context engineering, not just long prompts
>Prompt caching vs. semantic caching tradeoffs
>KV cache management, eviction, reuse, and memory pressure at scale
>Prefill vs. decode latency and why they optimize differently
>Continuous batching, paged attention, and throughput optimization
>Speculative decoding vs. quantization vs. distillation tradeoffs
>INT8, INT4, FP8, AWQ, GPTQ, and when quantization hurts quality
>Structured output failures, schema validation, repair loops, and fallback chains
>Function calling reliability, tool contracts, argument validation, and idempotency
>Agent guardrails, loop budgets, tool budgets, and termination conditions
>Model routing, graceful fallback logic, and degraded-mode UX
>RAG architecture: chunking, embeddings, hybrid search, reranking, and freshness
>Retrieval evals: recall, precision, grounding, attribution, and citation quality
>Evals: golden sets, regression tests, adversarial tests, LLM-as-judge, and human evals
>LLM observability as a first-class discipline: traces, spans, tokens, latency, errors, and drift
>Cost attribution per feature, workflow, tenant, and user journey not just per model
>Safety engineering: prompt injection defense, data leakage prevention, and permission boundaries
>Multi-tenant isolation, cache safety, and cross-user context contamination prevention
>Fine-tuning vs. in-context learning vs. RAG vs. distillation and when each is the wrong tool
>Latency, quality, cost, and reliability tradeoffs across the full inference stack
>Production failure modes: hallucinated tool calls, malformed JSON, stale retrieval, runaway agents, and silent eval regressions
"Are We Ready For An Agent-Native Memory System?"
Agents do not just need longer context windows. They need a real memory system, as agent memory now is more like a data management layer, not a retrieval problem.
This paper breaks memory into representation and storage, extraction, retrieval and routing, and maintenance, then evaluates 12 memory systems across 11 datasets.
They found that no memory architecture wins across the board. Graph memory is strong for updated facts and entity relations, hybrid systems do better for filtered recall, long context still helps when chronology matters, and append-only memory mostly returns facts.
🚨 A SENIOR ANTHROPIC ENGINEER JUST DROPPED AN 11-PAGE PDF ON LOOP ENGINEERING.
The core shift: stop prompting the agent. Build the system that prompts it.
Inside the autonomous loop:
- Discover → Finds its own work (failing CI, open issues).
- Isolate → Uses separate git worktrees to prevent collisions.
- Verify → A second agent reviews the work. (Never let agents self-grade).
- Persist → Writes to disk, not temporary context windows.
- Schedule → Runs automatically on a timer.
This is a great framework for building more reliable agentic systems
link to the guide below.
Read it, then check out this ace article on Loop Engineering by @akshay_pachaar 👇
A tricky LLM interview question:
You're serving a reasoning model on vLLM, and it keeps running out of GPU memory on long traces.
So you add KV cache compression and evict 90% of the cached tokens.
VRAM usage stays as is and GPU still runs out of memory.
Why?
(answer below)
Evicting 90% of the KV cache can free almost none of the memory it was using.
This sounds counterintuitive, but it follows directly from how production servers store the cache today.
The KV cache grows with every token a model generates. Each token appends its key and value vectors across every layer, and nothing is freed while generation continues.
This is the dominant memory cost for reasoning models.
If a 32K-token CoT caches ~32K tokens of KV vectors, a Qwen3-32B with 4-bit weights will run out-of-memory around 24K tokens on a 24GB GPU.
One obvious solution is to keep the important tokens and drop the rest, since attention is sparse enough to allow it.
But this does not solve the memory problem yet.
The reason is paged attention, which is the memory manager behind vLLM and most production servers.
Under the hood, it splits GPU memory into fixed physical blocks, each one holds the KV for about 16 tokens.
This block returns to the allocator only when every slot inside it is empty.
Since the eviction logic selects tokens by importance, and such tokens are scattered across blocks...
...so despite eviction, almost every block is left with at least some survivor tokens.
For instance, if the logic evicts 14k of 16k tokens across 1,000 blocks, most likely every block will still have a token.
This means the allocator frees almost nothing.
Placing the new tokens into those freed slots is not ideal because it breaks the cache's layout.
Say token 16,001 arrives, and it's placed in the slot the 40th token used to hold. The cache now reads position 38, then 16,001, then 41, so the cache is no longer in token order.
Attention can still compute the right answer from that, but only if every slot now carries a separate note recording which position it actually holds.
This introduces another bookkeeping cost that an in-order layout inherently avoids.
So the cache is logically 90% smaller and still physically the same size. Many compression results miss this because they measure on pre-allocated contiguous tensors rather than a paged server.
There's another problem.
Eviction methods pick which tokens to keep by looking at the attention scores themselves (as expected).
But fast attention kernels used in production, like FlashAttention, never save those scores.
They compute attention in small pieces and throw the full score grid away as they go, which is also why they're fast.
So the exact signal eviction methods need isn't available in memory. The workaround is to fall back to eager attention and build the full matrix, which gives up the speed FlashAttention was there to provide.
NVIDIA published a method called TriAttention to solve both these problems.
It never needs attention scores. Instead, it scores tokens from the geometry of the model's key and query vectors before RoPE is applied, where those vectors sit in stable clusters.
For the memory problem, it runs a compaction pass every 128 decoded tokens.
The surviving tokens slide forward to close the holes eviction creates, so whole blocks empty out and return to the allocator while the cache stays in token order.
On long reasoning traces, the approach matches full-attention accuracy while decoding 2.5x faster and using 10.7x less KV memory.
KV cache compression is a big infrastructure problem. The number that decides whether it works is the count of freed blocks, not the count of evicted tokens.
You can find the NVIDIA write-up here: https://t.co/ZwXv7VezVu
I wrote a first-principles breakdown of how the KV cache works. It walks through why the model stores keys and values at all, why the cache grows with every token, and a comparison of LLM generation speed with and without KV caching.
Read it below.
// Critique of the Agent Model //
Finally, a paper that tries to define what an agent is and what agency consists of.
Good read overall. (great bookmark)
The word agent now covers everything from a for-loop with tool calls to speculative machine superintelligence.
Eric Xing and colleagues ask where automation ends, and agency begins.
Drawing on Descartes and on science-fiction portrayals of autonomous beings, they analyze agent architectures along five dimensions: goal, identity, decision-making, self-regulation, and learning.
The argument is that genuine agency requires these structures to hold together in a specific way. Great paper overall, providing a vocabulary for arguing about what is and is not an agent.
Paper: https://t.co/qFvMxWd5cq
Learn to build effective AI agents in our academy: https://t.co/1e8RZKs4uX
this is f*cking dangerous
someone just open sourced the entire "LOOP ENGINEERING" framework for free
build a hedge fund printing alpha 24/7 by feeding it into claude code with my article below
bookmark before someone takes it down
A senior Anthropic engineer just dropped 11-page PDF on "Loop Engineering" for agentic systems.
The shift: you stop prompting the agent. You build the system that prompts it instead.
Schedule → Discover → Build → Verify → Repeat
Every loop runs one turn, five moves:
• Discovery: it finds its own work - failing CI, open issues, recent commits - instead of being handed a list.
• Handoff: each task gets an isolated git worktree so parallel agents don't collide.
• Verification: a second agent, told to assume the code is broken, reviews the first. The "thing that can say no."
• Persistence: results get written to disk, never left in a context window that gets flushed.
• Scheduling: an automation wakes it on a timer. That's what makes it a loop.
The key insight: an agent grading its own work always praises it.
This 11-page PDF changed how I'm building agentic systems today.
Read it now, then explore the article below.
A senior Anthropic engineer just published the clearest blueprint on "How to give your AI agent a real memory" and it's a 15-page PDF.
Write → Consolidate → Recall → Apply
• Write: after every attempt, the agent records what it tried and what happened.
• Consolidate: it distills those raw attempts into a few reusable lessons, not a transcript dump.
• Recall: before the next task, it reads those lessons first.
• Apply: it skips the dead ends it already learned, even on a brand new problem.
This is exactly how engineers now build agent loops in Claude Code.
Read the paper, then grab the setup below 👇
A senior Google engineer just dropped a 19-page PDF on "Loop Engineering" for LLM and agentic systems.
Act → Observe → Learn → Repeat
• Act: the LLM proposes a code transformation (tile this loop, parallelize that one).
• Observe: a compiler runs it and reports back - is it valid? faster? slower? by how much?
• Learn: the LLM reads that feedback and adjusts its next move.
• Repeat until it stops finding improvements.
The agent gets smarter purely from grounded feedback inside its own context window.
This 19-page PDF totally changed the way I’m building agentic systems today.
Read it now, then explore the article below.