I closed one chapter a few days ago: my last day at Google. Half a year of incredible research, surrounded by brilliant colleagues, an experience I’ll always treasure and recommend. Big Tech is safe, and safe is good. But when you’re young, too much safety means missing something vital.
What’s missing? The courage to go all in. The thrill of building 0 → 1.
So I packed my life, moved to San Francisco, and went all-in. I walked away from the safest paths, the big tech offer, the academic track, because if I never bet on myself, I’d regret it forever.
And now -- right at a moment in history when AI can change everything -- who could resist betting it all?
This is the next chapter: building the world’s greatest data infrastructure for ASI. This is bigger than me -- it’s a mission.
If you’re curious, want to support, or just want to chat -- DM me. And if I can help you in any way, my DMs are always open.
Let’s accelerate toward ASI together. 🚀
Fun fact: my last day wasn’t in South Bay, but in a SF office I’d never even been to before, because I was rushing to submit my ICLR paper🥲.
A lot of people have asked whether I’ll be at #ICLR 🇧🇷 this year. Sadly, I won’t make it in person. It has been an unusually busy stretch, and I ended up missing the trip.
Our CoDA is presenting at #ICLR now, and welcome to stop by and chat ☕️.
📅 Sat (today!), Apr 25, 2026, 10:30 AM – 1:00 PM (local time)
🏠 Pavilion 3, P3-#1602
While CoDA is presented in the context of scientific visualization, the core architectural ideas go far beyond that application.
What we really care about is a broader question: how agent systems can decompose complex tasks, collaborate across roles, and iteratively refine outputs until they become genuinely useful.
I’m also excited that the code is now publicly available through Google Research:
👩💻 https://t.co/GMpTfhJpk9
If you are thinking about multi agent systems, self-evolving, or harness, we would be very happy to discuss!
📝 Conf details: https://t.co/AfPkr7ySsK
📂 Project page: https://t.co/im1HznXbFc
With deep research revolutionizing research/data analysis, why are we still stuck in manually crafting data viz?
Meet CoDA (https://t.co/u7BsevqvHs): The ultimate multi-agent LLM powerhouse for auto-generating stunning plots from NL queries! Handles complex data, self-refines for perfection, & smashes baselines by 41.5%🚀
Key Features:
🌟Specialized agents for metadata analysis, planning, code gen/debug, & reflection
🌟Bypasses LLM input length limits w/ metadata focus
🌟Iterative loops for robust, human-like quality checks
🌟SOTA on MatplotBench & Qwen & DA-Code
#DataViz #AgenticAI #MultiAgent
Thank you for sharing our work!
Exciting direction!
What matters now is being able to measure whether agents are actually contributing to scientific and engineering progress, not just producing fluent outputs.
If research capable AI matters, evaluation has to be open, realistic, and community built.
Since launching #AutoLab, we’ve gotten a lot of inbound from researchers, builders, and friends.
What’s clear is this: the field wants a better standard for evaluating research-capable agents.
Our goal is simple: build a fair, open, transparent benchmark for agents that can operate in real scientific and engineering loops.
This should not be defined behind closed doors.
[Accidentally deleted this earlier, reposting] 😭
#AutoLab#autoresearch
We've been asking ourselves a question: if AI agents can now run hundreds of experiments overnight, how do we know whether they're actually contributing to research — or just generating noise?
That's why we built AutoLab (https://t.co/aRbV2YeaPf).
Not another pass/fail benchmark, but an open-source environment where agents face the same loop every researcher knows intimately — propose, test, fail, diagnose, revise, repeat.
23 tasks with no answer keys, just open search spaces and real constraints.
We ran 161 evaluations across 7 frontier models, 633M tokens. Every decision, every pivot, every dead end — all openly available in our Live Lab for anyone to replay and learn from.
What we found wasn't about which model is "smartest." It's about a capability we call closed-loop resilience: when incremental refinement stops working, can the agent recognize it and restructure? On one task, two frontier models hit the same wall. One kept pushing within the existing frame. The other stepped back and redesigned the approach entirely. That moment — knowing when to abandon a frame, not just optimize within it — is what separates real research from sophisticated pattern matching.
We believe this matters beyond benchmarking. If agents are genuinely entering the research loop, we want that transition to be measured transparently, built in the open, and shaped by the community — not locked inside any single lab. The scientist doesn't disappear. The loop gets a new participant. And we want to make sure that participant is understood.
This is a joint effort across @Stanford, @MIT, @UW, @UCSanDiego, @ucsantabarbara, @NotreDame, NUS, @Google, @NVIDIA, @IBMResearch, and @bakelab_hq. But 23 tasks is just the start. If you have an optimization problem you've spent weeks grinding on empirically — with a clear metric and no known optimal solution — it probably belongs here. Contribute a full task, a rough skeleton, or just the idea.
The best benchmarks aren't built by one team. They're built by the people who actually do the work!
Github: https://t.co/2sLNlASVcb
The more concentrated frontier capability becomes, the more important open evaluation becomes.
If powerful agentic systems are going to shape critical work, the field cannot rely only on selective access, internal safeguards, and closed reporting to understand what these systems can actually do.
We need public evaluation surfaces with open tasks, replayable runs, visible failures, and community scrutiny.
That is a big part of why we built AutoLab.
https://t.co/wiLLUw3Ww4
Introducing Project Glasswing: an urgent initiative to help secure the world’s most critical software.
It’s powered by our newest frontier model, Claude Mythos Preview, which can find software vulnerabilities better than all but the most skilled humans.
https://t.co/NQ7IfEtYk7
[Accidentally deleted this earlier, reposting] 😭
#AutoLab#autoresearch
We've been asking ourselves a question: if AI agents can now run hundreds of experiments overnight, how do we know whether they're actually contributing to research — or just generating noise?
That's why we built AutoLab (https://t.co/aRbV2YeaPf).
Not another pass/fail benchmark, but an open-source environment where agents face the same loop every researcher knows intimately — propose, test, fail, diagnose, revise, repeat.
23 tasks with no answer keys, just open search spaces and real constraints.
We ran 161 evaluations across 7 frontier models, 633M tokens. Every decision, every pivot, every dead end — all openly available in our Live Lab for anyone to replay and learn from.
What we found wasn't about which model is "smartest." It's about a capability we call closed-loop resilience: when incremental refinement stops working, can the agent recognize it and restructure? On one task, two frontier models hit the same wall. One kept pushing within the existing frame. The other stepped back and redesigned the approach entirely. That moment — knowing when to abandon a frame, not just optimize within it — is what separates real research from sophisticated pattern matching.
We believe this matters beyond benchmarking. If agents are genuinely entering the research loop, we want that transition to be measured transparently, built in the open, and shaped by the community — not locked inside any single lab. The scientist doesn't disappear. The loop gets a new participant. And we want to make sure that participant is understood.
This is a joint effort across @Stanford, @MIT, @UW, @UCSanDiego, @ucsantabarbara, @NotreDame, NUS, @Google, @NVIDIA, @IBMResearch, and @bakelab_hq. But 23 tasks is just the start. If you have an optimization problem you've spent weeks grinding on empirically — with a clear metric and no known optimal solution — it probably belongs here. Contribute a full task, a rough skeleton, or just the idea.
The best benchmarks aren't built by one team. They're built by the people who actually do the work!
Github: https://t.co/2sLNlASVcb
@karpathy's AutoResearch made one thing visible:
the frontier question is no longer whether a model can answer once.
It is whether it can survive the loop.
That is why we built AutoLab.
161 evals | 23 tasks | 7 frontier models | 8,891 trajectories | 633M tokens
If you want to watch agents struggle, double down, pivot, and occasionally break through, come watch the Live Lab:
https://t.co/v4HRAc8ouz
@karpathy's AutoResearch made one thing visible:
the frontier question is no longer whether a model can answer once.
It is whether it can survive the loop.
That is why we built AutoLab.
161 evals | 23 tasks | 7 frontier models | 8,891 trajectories | 633M tokens
If you want to watch agents struggle, double down, pivot, and occasionally break through, come watch the Live Lab:
https://t.co/v4HRAc7QF1
@hvngo8 Yes, exactly. A big part of the challenge is not just running more experiments, but knowing when a line of attack has stopped being informative. That is a big reason we built Live Lab, so people can inspect those pivots more directly.
@picocreator Totally agree. Any useful benchmark will eventually face this pressure. That is why we made AutoLab open and trajectory-level, so people can inspect how agents actually search, fail, and adapt.