Very good advice on self-improving agents.
(bookmark it)
This is something I am seeing in my own experiments with coding agents and harnesses for long-horizon tasks.
What I have found is that stronger models do not always evolve better agents.
The current believe in self-evolving agents is that a bigger model writes better prompt and skill edits, so devs put their best model in the evolver seat.
New research shows that intuition is mostly wrong.
The work separates two abilities that usually get conflated. Producing harness updates stays flat across model capability, so Qwen3.5-9B writes edits roughly as good as Claude Opus 4.6. Benefiting from those updates follows an inverted-U that peaks at mid-tier models, while weak models fail to even activate the edits and strong models have little headroom left.
This is important to understand as it tells you where to spend. Put a cheap model on the evolver and your expensive model on the solver, because the gains land solver-side, not evolver-side.
Paper: https://t.co/8kJwR7NhmV
Learn to build effective AI agents in our academy: https://t.co/1e8RZKs4uX
I've got an agent in a loop optimizing a renderer with the goal to minimize frame times (and tests to measure). It got times down from 88ms to 2ms and allocations down from ~150K to 500. Sounds good, right? Wrong. This is exactly why agent psychosis is a big fucking problem.
As an experiment, I rewrote the Ghostty core render state in Go, with access to identically laid out data structures as Ghostty and the exact same validation tests. I made a purposely naive renderer (simple, correct, but slow). 88ms per frame with 150,000 allocations (horrendous, lol)!
I then kickstarted a Ralph loop to bring the frame times down. I told it it can't modify input data structures or the public API or tests (they're correct), but it can do anything else it wants. It got to work.
It has worked for about 4 hours. I've spent around $350 on this experiment so far. The results?
88ms => 1.5ms
150K allocs => ~500 allocs
Incredible right? Nope.
My hand-written renderer I ported has frame times (same benchmark) of ~20us (0.020ms) and 0 allocations in the update path.
This is the problem with psychosis and lacking systems understanding. If you don't understand the system, you're going to accept that this is an incredible result. If you understand the system, you'll see better solutions immediately and can do roughly 75x better on throughput.
The people who blindly trust agent output are in the former camp. They're sheeple, overdrinking from a fountain of mediocrity.
Standard disclaimer: I use AI all the time. I like AI. The point I'm making is to not blindly accept results. Think. Analyze. Learn.
There are two loops in every founder's head.
The autism loop: run your own model to the floor, ignore consensus, hold a thesis when everyone says you're wrong. That makes conviction.
The empathy loop: feel what the user feels, sense what the market wants before it has words. That makes traction.
Most people crank one and starve the other. Pure conviction builds something brilliant nobody wants. Pure empathy builds consensus mush.
PG put the whole job in four words: make something people want. The autism loop makes the something. The empathy loop knows it's wanted. The founder is the bridge.
Most great founders show up dominant in the first loop. That's why they're contrarian enough to try at all. The work is grafting on the second.
There is no place in the world that helps founders make the two loops work together to make great startups than Y Combinator. It is the most gratifying part of our work.
Footage comparing before and after shows the dramatic progress made by Gath, a Parkinson’s patient diagnosed 12 years ago, after only two days using the Produodopa treatment.
"I think that typing code manually is dying."
- @shanselman on The Agentic Review podcast
We got into what that means and much more on the last episode. If you're thinking about where the craft is actually heading, this one's worth your time.
We built a bipedal robot for about $2,500.
A real, mostly 3D-printed robot you can build, repair, simulate, train, and control.
Today we’re releasing LeRobot Humanoid: an open robot-learning platform with hardware, runtime, identification tools, and training environments.
Blog post: https://t.co/zu2etb1NZo
Repo: https://t.co/4myLRUtZ3W
Never mind about the Rolex… had no idea buying one was like buying a Ferrari, where you have to beg, jump though hoops and then buy products you don’t want to get the one you do want! 😂
thought it would be fun to own, but would much prefer to just order something on a website (like a Tesla!) and be done with it.
These retail and reseller channels make things far too time consuming
Is there a nice watch I can just… order?
We’re training models wrong and it’s due to chatGPT. Even the modern coding agents used daily still use message-based exchanges: They send messages to users, to themselves (CoT) and to tools, and receive messages in turn.
This bottlenecks even very intelligent agents to a single stream. The models cannot read while writing, cannot act while thinking and cannot think while processing information.
In our new paper, see below, we discuss LLMs with parallel streams. We show that multi-stream LLMs can …
🔵Be created by instruction-tuning for the stream format
🔵Simplify user and tool use UX removing many pain points with agents and chat models (such as having to interrupt the model to get a word in)
🔵Multi-Stream LLMs are fast, they can predict+read tokens in all streams in parallel in each forward pass, improving latency
🔵 LLMs with multiple streams have an easier time encoding a separation of concerns, improving security
🔵 LLMs with many internal streams provide a legible form of parallel/cont. reasoning. Even if the main CoT stream is accidentally pressured or too focused on a particular task to voice concerns, other internal streams can subvocalize concerns that would otherwise not be verbalized.
Does this sound related to a recent thinky post :) - Yes, but I don’t feel so bad about being outshipped with such a cool report on their side by 23 hours. I’ll link a 2nd thread below with a more direct comparison. I actually think both are complementary in interesting ways.
The quantity of code that devs ship has roughly 10xed. But net developer productivity (value created by unit of time) is only up by a bit, if at all. Part of it is that the additional code is solving more incremental problems. A bigger part is that the new code is creating problems of its own.
I’m so glad AI killed LeetCode interviews.
For 10 years, tech companies made every engineer grind the same puzzles and prove they could invert a binary tree from memory.
Today, the dumbest AI model can walk in and one-shot the entire interview.
Thank you, AI.
Burkina Faso again, showing us how good infrastructure can be when you use local materials intelligently.
The Koudougou Central Market won the Aga Khan Award for Architecture. 29,000 square metres. 1,800 vendors. Primary construction material: compressed earth blocks sourced from two kilometres away. The vaulted arches were built by local enterprises with specialized earth construction expertise.
The layout does two things simultaneously; maximum commercial density and maximum natural ventilation. Two overlapping grids create shade between neighboring structures while channeling cross breezes through the alleys. The market cools itself. In Burkina Faso’s extreme heat, that’s not a design gesture, it’s survival infrastructure.
The deeper story is economic. Local materials meant local labour, earth extraction, brick production, vault construction, all community income. A skilled workforce was trained and retained. Demand for earth construction in the region grew long after the build ended.
Then the market opened. 1,800 vendors with dignified, shaded, ventilated trading spaces. Better conditions mean longer trading hours, stable livelihoods and a neighbourhood that actually prospers. A market that works climatically is a market that works economically.
📍Koudougou Central Market, Koudougou, Burkina Faso. Laurent Séchaud / SDC.
@GoogleDeepMind Good way to indicate where my attention is without having to explain every little detail.
Good way to indicate to Gemini where I want to focus its attention for its actions.
We’re reimagining a 50-year-old interface - the mouse pointer - with AI. 🖱️
These experimental demos show how people can intuitively direct Gemini on their screens using motion, speech, and natural shorthand to get things done 🧵
@eladgil BS.
Attention was born in Montréal
PyTorch in NYC.
AlphaGo in London
AlphaFold in London
ESMFold in NYC
Llama 1 in Paris.
Llama 2 in Paris+NYC+SV
DeepSeek in Hangzhou
Plus:
DINO in Paris
JEPA in Montréal+Paris+NYC
SV is 3 mos ahead on topics SV is singularly obsessed with.