4/ Retrieval moves inside the context. The product question becomes which parts of the billion tokens the agent should attend to for which task.
Full piece: https://t.co/C4YsPnTW7M
1/ 🧠 A 1M token context window changed what an agent could do in one call.
When a 1B token window arrives at scale, it will change what an agent can hold in its head.
3/ RAG was designed for a scarce window. At a billion tokens, the chunk boundary becomes the bug.
Carrying context beats retrieving it once cost-per-token falls.
Zerg builds AI systems for mission-critical work. The retry loop is our default. On hard problems, the default is wrong.
We don't have a memory layer like this in our pipeline yet. That's exactly why this paper jumped out.
Read our full thoughts: https://t.co/A68UsjjRwz
Most AI coding agents write code by retrying — same broken approach, slightly rephrased.
A new paper from CMU, U Washington, and Arm asked the obvious question: What if the agent actually *remembered* what went wrong?
🧵
On KernelBench Level-2: 3.12x speedup over PyTorch eager within 100 steps.
On FlashInfer-Bench RMSNorm: 1.75x faster than expert-written CUDA.
Not a naive baseline. The kernel teams actually deploy.
AI is changing how we build, reducing costs and development time.
Read our take on the changing environment in our newest blog post: https://t.co/J7011pfuaR