Large language models spontaneously develop the same specialized brain regions humans have for language, math, physics, and social reasoning. No one designed this. It just emerged.
Two completely different optimization processes (biological evolution vs. gradient descent) independently arrived at the same solution.
(1/N) autoresearch 🤝 weather forecasting - a thread 🧵
Can an automatic research loop improve a real weather dynamical core by making physics-informed changes? TBH we weren’t expecting much, but the early results were surprising enough for us to share:
A tricky LLM interview question:
You're serving a reasoning model on vLLM, and it keeps running out of GPU memory on long traces.
So you add KV cache compression and evict 90% of the cached tokens.
VRAM usage stays as is and GPU still runs out of memory.
Why?
(answer below)
Evicting 90% of the KV cache can free almost none of the memory it was using.
This sounds counterintuitive, but it follows directly from how production servers store the cache today.
The KV cache grows with every token a model generates. Each token appends its key and value vectors across every layer, and nothing is freed while generation continues.
This is the dominant memory cost for reasoning models.
If a 32K-token CoT caches ~32K tokens of KV vectors, a Qwen3-32B with 4-bit weights will run out-of-memory around 24K tokens on a 24GB GPU.
One obvious solution is to keep the important tokens and drop the rest, since attention is sparse enough to allow it.
But this does not solve the memory problem yet.
The reason is paged attention, which is the memory manager behind vLLM and most production servers.
Under the hood, it splits GPU memory into fixed physical blocks, each one holds the KV for about 16 tokens.
This block returns to the allocator only when every slot inside it is empty.
Since the eviction logic selects tokens by importance, and such tokens are scattered across blocks...
...so despite eviction, almost every block is left with at least some survivor tokens.
For instance, if the logic evicts 14k of 16k tokens across 1,000 blocks, most likely every block will still have a token.
This means the allocator frees almost nothing.
Placing the new tokens into those freed slots is not ideal because it breaks the cache's layout.
Say token 16,001 arrives, and it's placed in the slot the 40th token used to hold. The cache now reads position 38, then 16,001, then 41, so the cache is no longer in token order.
Attention can still compute the right answer from that, but only if every slot now carries a separate note recording which position it actually holds.
This introduces another bookkeeping cost that an in-order layout inherently avoids.
So the cache is logically 90% smaller and still physically the same size. Many compression results miss this because they measure on pre-allocated contiguous tensors rather than a paged server.
There's another problem.
Eviction methods pick which tokens to keep by looking at the attention scores themselves (as expected).
But fast attention kernels used in production, like FlashAttention, never save those scores.
They compute attention in small pieces and throw the full score grid away as they go, which is also why they're fast.
So the exact signal eviction methods need isn't available in memory. The workaround is to fall back to eager attention and build the full matrix, which gives up the speed FlashAttention was there to provide.
NVIDIA published a method called TriAttention to solve both these problems.
It never needs attention scores. Instead, it scores tokens from the geometry of the model's key and query vectors before RoPE is applied, where those vectors sit in stable clusters.
For the memory problem, it runs a compaction pass every 128 decoded tokens.
The surviving tokens slide forward to close the holes eviction creates, so whole blocks empty out and return to the allocator while the cache stays in token order.
On long reasoning traces, the approach matches full-attention accuracy while decoding 2.5x faster and using 10.7x less KV memory.
KV cache compression is a big infrastructure problem. The number that decides whether it works is the count of freed blocks, not the count of evicted tokens.
You can find the NVIDIA write-up here: https://t.co/ZwXv7VezVu
I wrote a first-principles breakdown of how the KV cache works. It walks through why the model stores keys and values at all, why the cache grows with every token, and a comparison of LLM generation speed with and without KV caching.
Read it below.
Introducing Sakana Fugu: A full multi-agent orchestration system accessible via a single model API.
Our ‘Fugu Ultra’ model matches the performance of Fable and Mythos, delivering frontier capability without the risk of export controls.
Try it: https://t.co/hhO6qTawgb 🐡
Karpathy's prediction about RL is coming true now!
He called reward functions unreliable and argued that a single reward number is too low-dimensional to teach an agent what "good" means for complex tasks. To solve this, Agents need a knowledge-guided review as a higher-dimensional feedback channel.
Every major AI lab trains models with RL today (OpenAI, Anthropic, DeepSeek).
And their key bottleneck has always been the reward functions.
GRPO by DeepSeek worked well for math and code because the environment gave a binary signal.
But for real agent tasks, someone still has to hand-code the scoring function. That takes days and breaks every time the pipeline changes.
RULER (implemented in OpenPipe ART, 10k stars) addresses the exact problem Karpathy identified.
The reward criteria are defined in plain English, and an LLM evaluates each trajectory against that description to provide feedback for training.
I trained a Qwen3 1.4B agent that plays 2048 using GRPO with this exact workflow.
In this case, the agent saw the board, picked a direction, and RULER evaluated the outcome, all from this natural language definition.
You can see the full implementation on GitHub and try it yourself.
Here's the ART Repo: https://t.co/XeTppNyX9p
(don't forget to star it ⭐ )
Just like RLHF replaced manual rankings and GRPO replaced the critic model, natural language rewards are replacing hand-coded scoring functions.
RL reward engineering is now prompt engineering.
I wrote a full walkthrough on OpenPipe's ART, the agent RL trainer built on GRPO, including how RULER replaces manual reward engineering with automatic LLM-graded rewards.
The article is quoted below.
MIT just open-sourced a model that could end the $150/hour CAD industry.
It’s called GenCAD. It converts photos into fully editable CAD programs. Just upload a sketch or photo and it generates the full parametric 3D model.
100% Open Source.
“FlashMemory-DeepSeek-V4: Lightning Index Ultra-Long Context via Lookahead Sparse Attention”
With how Long context LLMs are being bottlenecked by KV cache, because every old token keeps consuming GPU memory even when most of it is irrelevant, this paper turns long context into retrieval.
They used a small Memory Indexer to predict which old KV chunks the model will need soon, keeps only those on GPU, and leaves the rest offloaded.
This provides 13.5% average KV cache footprint, up to 90% memory reduction at 500K context, with slightly better accuracy than DS V4 Flash.