Zichen Chen (🐱,💖)

Verified account

@my_cat_can_code

Co-founder @ Stealth Startup 🍞| AI researcher @stanford @UCSB | All in ASI 📖 | Building for this universe 🌌 | ex-@googleresearch

Palo Alto, CA

Joined November 2022

471 Following

6.1K Followers

307 Posts

Pinned Tweet

Zichen Chen (🐱,💖)

@my_cat_can_code

8 months ago

I closed one chapter a few days ago: my last day at Google. Half a year of incredible research, surrounded by brilliant colleagues, an experience I’ll always treasure and recommend. Big Tech is safe, and safe is good. But when you’re young, too much safety means missing something vital. What’s missing? The courage to go all in. The thrill of building 0 → 1. So I packed my life, moved to San Francisco, and went all-in. I walked away from the safest paths, the big tech offer, the academic track, because if I never bet on myself, I’d regret it forever. And now -- right at a moment in history when AI can change everything -- who could resist betting it all? This is the next chapter: building the world’s greatest data infrastructure for ASI. This is bigger than me -- it’s a mission. If you’re curious, want to support, or just want to chat -- DM me. And if I can help you in any way, my DMs are always open. Let’s accelerate toward ASI together. 🚀 Fun fact: my last day wasn’t in South Bay, but in a SF office I’d never even been to before, because I was rushing to submit my ICLR paper🥲.

my_cat_can_code's tweet photo. I closed one chapter a few days ago: my last day at Google. Half a year of incredible research, surrounded by brilliant colleagues, an experience I’ll always treasure and recommend. Big Tech is safe, and safe is good. But when you’re young, too much safety means missing something vital.

What’s missing? The courage to go all in. The thrill of building 0 → 1.

So I packed my life, moved to San Francisco, and went all-in. I walked away from the safest paths, the big tech offer, the academic track, because if I never bet on myself, I’d regret it forever.

And now -- right at a moment in history when AI can change everything -- who could resist betting it all?

This is the next chapter: building the world’s greatest data infrastructure for ASI. This is bigger than me -- it’s a mission.

If you’re curious, want to support, or just want to chat -- DM me. And if I can help you in any way, my DMs are always open.

Let’s accelerate toward ASI together. 🚀

Fun fact: my last day wasn’t in South Bay, but in a SF office I’d never even been to before, because I was rushing to submit my ICLR paper🥲.

134

3K

76

665

448K

Zichen Chen (🐱,💖)

@my_cat_can_code

about 1 month ago

@Yihe__Deng Dear Yihe, do you want to talk? 🥰

1

2

0

0

1K

Zichen Chen (🐱,💖)

@my_cat_can_code

about 1 month ago

A lot of people have asked whether I’ll be at #ICLR 🇧🇷 this year. Sadly, I won’t make it in person. It has been an unusually busy stretch, and I ended up missing the trip. Our CoDA is presenting at #ICLR now, and welcome to stop by and chat ☕️. 📅 Sat (today!), Apr 25, 2026, 10:30 AM – 1:00 PM (local time) 🏠 Pavilion 3, P3-#1602 While CoDA is presented in the context of scientific visualization, the core architectural ideas go far beyond that application. What we really care about is a broader question: how agent systems can decompose complex tasks, collaborate across roles, and iteratively refine outputs until they become genuinely useful. I’m also excited that the code is now publicly available through Google Research: 👩‍💻 https://t.co/GMpTfhJpk9 If you are thinking about multi agent systems, self-evolving, or harness, we would be very happy to discuss! 📝 Conf details: https://t.co/AfPkr7ySsK 📂 Project page: https://t.co/im1HznXbFc

Zichen Chen (🐱,💖)

@my_cat_can_code

8 months ago

With deep research revolutionizing research/data analysis, why are we still stuck in manually crafting data viz? Meet CoDA (https://t.co/u7BsevqvHs): The ultimate multi-agent LLM powerhouse for auto-generating stunning plots from NL queries! Handles complex data, self-refines for perfection, & smashes baselines by 41.5%🚀 Key Features: 🌟Specialized agents for metadata analysis, planning, code gen/debug, & reflection 🌟Bypasses LLM input length limits w/ metadata focus 🌟Iterative loops for robust, human-like quality checks 🌟SOTA on MatplotBench & Qwen & DA-Code #DataViz #AgenticAI #MultiAgent

my_cat_can_code's tweet photo. With deep research revolutionizing research/data analysis, why are we still stuck in manually crafting data viz?

Meet CoDA (https://t.co/u7BsevqvHs): The ultimate multi-agent LLM powerhouse for auto-generating stunning plots from NL queries! Handles complex data, self-refines for perfection, & smashes baselines by 41.5%🚀

Key Features:
🌟Specialized agents for metadata analysis, planning, code gen/debug, & reflection
🌟Bypasses LLM input length limits w/ metadata focus
🌟Iterative loops for robust, human-like quality checks
🌟SOTA on MatplotBench & Qwen & DA-Code

#DataViz #AgenticAI #MultiAgent

8

72

9

28

19K

2

29

6

8

4K

Zichen Chen (🐱,💖)

@my_cat_can_code

about 2 months ago

Thank you for sharing our work! Exciting direction! What matters now is being able to measure whether agents are actually contributing to scientific and engineering progress, not just producing fluent outputs. If research capable AI matters, evaluation has to be open, realistic, and community built.

Charles Wu 吴英成AI🦞

about 2 months ago

Agent will take over human science!

0

10

1

1

2K

2

7

0

2

1K

Who to follow

Verified account

Postdoc@UC Berkeley CS; Research: ML, NLP, AI Safety

Generative AI for language and science. MT, LLM, GenAI Safety, Drug Discovery

Assistant Professor @RiceCompSci, Postdoc @jhuclsp, PhD @CS_UVA, former intern @allen_ai, @MSFTResearch, @IBM, #NLProc

Zichen Chen (🐱,💖)

@my_cat_can_code

about 2 months ago

@Fried_rice This is legendary, most systems are such fragile

0

4

0

0

783

Zichen Chen (🐱,💖)

@my_cat_can_code

about 2 months ago

Since launching #AutoLab, we’ve gotten a lot of inbound from researchers, builders, and friends. What’s clear is this: the field wants a better standard for evaluating research-capable agents. Our goal is simple: build a fair, open, transparent benchmark for agents that can operate in real scientific and engineering loops. This should not be defined behind closed doors.

Zichen Chen (🐱,💖)

@my_cat_can_code

2 months ago

[Accidentally deleted this earlier, reposting] 😭 #AutoLab #autoresearch We've been asking ourselves a question: if AI agents can now run hundreds of experiments overnight, how do we know whether they're actually contributing to research — or just generating noise? That's why we built AutoLab (https://t.co/aRbV2YeaPf). Not another pass/fail benchmark, but an open-source environment where agents face the same loop every researcher knows intimately — propose, test, fail, diagnose, revise, repeat. 23 tasks with no answer keys, just open search spaces and real constraints. We ran 161 evaluations across 7 frontier models, 633M tokens. Every decision, every pivot, every dead end — all openly available in our Live Lab for anyone to replay and learn from. What we found wasn't about which model is "smartest." It's about a capability we call closed-loop resilience: when incremental refinement stops working, can the agent recognize it and restructure? On one task, two frontier models hit the same wall. One kept pushing within the existing frame. The other stepped back and redesigned the approach entirely. That moment — knowing when to abandon a frame, not just optimize within it — is what separates real research from sophisticated pattern matching. We believe this matters beyond benchmarking. If agents are genuinely entering the research loop, we want that transition to be measured transparently, built in the open, and shaped by the community — not locked inside any single lab. The scientist doesn't disappear. The loop gets a new participant. And we want to make sure that participant is understood. This is a joint effort across @Stanford, @MIT, @UW, @UCSanDiego, @ucsantabarbara, @NotreDame, NUS, @Google, @NVIDIA, @IBMResearch, and @bakelab_hq. But 23 tasks is just the start. If you have an optimization problem you've spent weeks grinding on empirically — with a clear metric and no known optimal solution — it probably belongs here. Contribute a full task, a rough skeleton, or just the idea. The best benchmarks aren't built by one team. They're built by the people who actually do the work! Github: https://t.co/2sLNlASVcb

my_cat_can_code's tweet photo. [Accidentally deleted this earlier, reposting] 😭
#AutoLab #autoresearch
We've been asking ourselves a question: if AI agents can now run hundreds of experiments overnight, how do we know whether they're actually contributing to research — or just generating noise?

That's why we built AutoLab (https://t.co/aRbV2YeaPf).
Not another pass/fail benchmark, but an open-source environment where agents face the same loop every researcher knows intimately — propose, test, fail, diagnose, revise, repeat.

23 tasks with no answer keys, just open search spaces and real constraints.
We ran 161 evaluations across 7 frontier models, 633M tokens. Every decision, every pivot, every dead end — all openly available in our Live Lab for anyone to replay and learn from.

What we found wasn't about which model is "smartest." It's about a capability we call closed-loop resilience: when incremental refinement stops working, can the agent recognize it and restructure? On one task, two frontier models hit the same wall. One kept pushing within the existing frame. The other stepped back and redesigned the approach entirely. That moment — knowing when to abandon a frame, not just optimize within it — is what separates real research from sophisticated pattern matching.

We believe this matters beyond benchmarking. If agents are genuinely entering the research loop, we want that transition to be measured transparently, built in the open, and shaped by the community — not locked inside any single lab. The scientist doesn't disappear. The loop gets a new participant. And we want to make sure that participant is understood.

This is a joint effort across @Stanford, @MIT, @UW, @UCSanDiego, @ucsantabarbara, @NotreDame, NUS, @Google, @NVIDIA, @IBMResearch, and @bakelab_hq. But 23 tasks is just the start. If you have an optimization problem you've spent weeks grinding on empirically — with a clear metric and no known optimal solution — it probably belongs here. Contribute a full task, a rough skeleton, or just the idea.

The best benchmarks aren't built by one team. They're built by the people who actually do the work!

Github: https://t.co/2sLNlASVcb

my_cat_can_code's tweet photo. [Accidentally deleted this earlier, reposting] 😭
#AutoLab #autoresearch
We've been asking ourselves a question: if AI agents can now run hundreds of experiments overnight, how do we know whether they're actually contributing to research — or just generating noise?

That's why we built AutoLab (https://t.co/aRbV2YeaPf).
Not another pass/fail benchmark, but an open-source environment where agents face the same loop every researcher knows intimately — propose, test, fail, diagnose, revise, repeat.

23 tasks with no answer keys, just open search spaces and real constraints.
We ran 161 evaluations across 7 frontier models, 633M tokens. Every decision, every pivot, every dead end — all openly available in our Live Lab for anyone to replay and learn from.

What we found wasn't about which model is "smartest." It's about a capability we call closed-loop resilience: when incremental refinement stops working, can the agent recognize it and restructure? On one task, two frontier models hit the same wall. One kept pushing within the existing frame. The other stepped back and redesigned the approach entirely. That moment — knowing when to abandon a frame, not just optimize within it — is what separates real research from sophisticated pattern matching.

We believe this matters beyond benchmarking. If agents are genuinely entering the research loop, we want that transition to be measured transparently, built in the open, and shaped by the community — not locked inside any single lab. The scientist doesn't disappear. The loop gets a new participant. And we want to make sure that participant is understood.

This is a joint effort across @Stanford, @MIT, @UW, @UCSanDiego, @ucsantabarbara, @NotreDame, NUS, @Google, @NVIDIA, @IBMResearch, and @bakelab_hq. But 23 tasks is just the start. If you have an optimization problem you've spent weeks grinding on empirically — with a clear metric and no known optimal solution — it probably belongs here. Contribute a full task, a rough skeleton, or just the idea.

The best benchmarks aren't built by one team. They're built by the people who actually do the work!

Github: https://t.co/2sLNlASVcb

my_cat_can_code's tweet photo. [Accidentally deleted this earlier, reposting] 😭
#AutoLab #autoresearch
We've been asking ourselves a question: if AI agents can now run hundreds of experiments overnight, how do we know whether they're actually contributing to research — or just generating noise?

That's why we built AutoLab (https://t.co/aRbV2YeaPf).
Not another pass/fail benchmark, but an open-source environment where agents face the same loop every researcher knows intimately — propose, test, fail, diagnose, revise, repeat.

23 tasks with no answer keys, just open search spaces and real constraints.
We ran 161 evaluations across 7 frontier models, 633M tokens. Every decision, every pivot, every dead end — all openly available in our Live Lab for anyone to replay and learn from.

What we found wasn't about which model is "smartest." It's about a capability we call closed-loop resilience: when incremental refinement stops working, can the agent recognize it and restructure? On one task, two frontier models hit the same wall. One kept pushing within the existing frame. The other stepped back and redesigned the approach entirely. That moment — knowing when to abandon a frame, not just optimize within it — is what separates real research from sophisticated pattern matching.

We believe this matters beyond benchmarking. If agents are genuinely entering the research loop, we want that transition to be measured transparently, built in the open, and shaped by the community — not locked inside any single lab. The scientist doesn't disappear. The loop gets a new participant. And we want to make sure that participant is understood.

This is a joint effort across @Stanford, @MIT, @UW, @UCSanDiego, @ucsantabarbara, @NotreDame, NUS, @Google, @NVIDIA, @IBMResearch, and @bakelab_hq. But 23 tasks is just the start. If you have an optimization problem you've spent weeks grinding on empirically — with a clear metric and no known optimal solution — it probably belongs here. Contribute a full task, a rough skeleton, or just the idea.

The best benchmarks aren't built by one team. They're built by the people who actually do the work!

Github: https://t.co/2sLNlASVcb

my_cat_can_code's tweet photo. [Accidentally deleted this earlier, reposting] 😭
#AutoLab #autoresearch
We've been asking ourselves a question: if AI agents can now run hundreds of experiments overnight, how do we know whether they're actually contributing to research — or just generating noise?

That's why we built AutoLab (https://t.co/aRbV2YeaPf).
Not another pass/fail benchmark, but an open-source environment where agents face the same loop every researcher knows intimately — propose, test, fail, diagnose, revise, repeat.

23 tasks with no answer keys, just open search spaces and real constraints.
We ran 161 evaluations across 7 frontier models, 633M tokens. Every decision, every pivot, every dead end — all openly available in our Live Lab for anyone to replay and learn from.

What we found wasn't about which model is "smartest." It's about a capability we call closed-loop resilience: when incremental refinement stops working, can the agent recognize it and restructure? On one task, two frontier models hit the same wall. One kept pushing within the existing frame. The other stepped back and redesigned the approach entirely. That moment — knowing when to abandon a frame, not just optimize within it — is what separates real research from sophisticated pattern matching.

We believe this matters beyond benchmarking. If agents are genuinely entering the research loop, we want that transition to be measured transparently, built in the open, and shaped by the community — not locked inside any single lab. The scientist doesn't disappear. The loop gets a new participant. And we want to make sure that participant is understood.

This is a joint effort across @Stanford, @MIT, @UW, @UCSanDiego, @ucsantabarbara, @NotreDame, NUS, @Google, @NVIDIA, @IBMResearch, and @bakelab_hq. But 23 tasks is just the start. If you have an optimization problem you've spent weeks grinding on empirically — with a clear metric and no known optimal solution — it probably belongs here. Contribute a full task, a rough skeleton, or just the idea.

The best benchmarks aren't built by one team. They're built by the people who actually do the work!

Github: https://t.co/2sLNlASVcb

2

78

14

63

10K

0

18

1

5

3K

Zichen Chen (🐱,💖)

@my_cat_can_code

about 2 months ago

The more concentrated frontier capability becomes, the more important open evaluation becomes. If powerful agentic systems are going to shape critical work, the field cannot rely only on selective access, internal safeguards, and closed reporting to understand what these systems can actually do. We need public evaluation surfaces with open tasks, replayable runs, visible failures, and community scrutiny. That is a big part of why we built AutoLab. https://t.co/wiLLUw3Ww4

my_cat_can_code's tweet photo. The more concentrated frontier capability becomes, the more important open evaluation becomes.

If powerful agentic systems are going to shape critical work, the field cannot rely only on selective access, internal safeguards, and closed reporting to understand what these systems can actually do.

We need public evaluation surfaces with open tasks, replayable runs, visible failures, and community scrutiny.

That is a big part of why we built AutoLab.
https://t.co/wiLLUw3Ww4

about 2 months ago

Introducing Project Glasswing: an urgent initiative to help secure the world’s most critical software. It’s powered by our newest frontier model, Claude Mythos Preview, which can find software vulnerabilities better than all but the most skilled humans. https://t.co/NQ7IfEtYk7

2K

44K

7K

16K

31M

0

10

0

3

1K

my_cat_can_code retweeted

2 months ago

Awesome benchmark I've come across recently. It highlights how crucial the environment is, echoing my previous points about Interactive environments.

0

4

1

1

876

Zichen Chen (🐱,💖)

@my_cat_can_code

2 months ago

[Accidentally deleted this earlier, reposting] 😭 #AutoLab #autoresearch We've been asking ourselves a question: if AI agents can now run hundreds of experiments overnight, how do we know whether they're actually contributing to research — or just generating noise? That's why we built AutoLab (https://t.co/aRbV2YeaPf). Not another pass/fail benchmark, but an open-source environment where agents face the same loop every researcher knows intimately — propose, test, fail, diagnose, revise, repeat. 23 tasks with no answer keys, just open search spaces and real constraints. We ran 161 evaluations across 7 frontier models, 633M tokens. Every decision, every pivot, every dead end — all openly available in our Live Lab for anyone to replay and learn from. What we found wasn't about which model is "smartest." It's about a capability we call closed-loop resilience: when incremental refinement stops working, can the agent recognize it and restructure? On one task, two frontier models hit the same wall. One kept pushing within the existing frame. The other stepped back and redesigned the approach entirely. That moment — knowing when to abandon a frame, not just optimize within it — is what separates real research from sophisticated pattern matching. We believe this matters beyond benchmarking. If agents are genuinely entering the research loop, we want that transition to be measured transparently, built in the open, and shaped by the community — not locked inside any single lab. The scientist doesn't disappear. The loop gets a new participant. And we want to make sure that participant is understood. This is a joint effort across @Stanford, @MIT, @UW, @UCSanDiego, @ucsantabarbara, @NotreDame, NUS, @Google, @NVIDIA, @IBMResearch, and @bakelab_hq. But 23 tasks is just the start. If you have an optimization problem you've spent weeks grinding on empirically — with a clear metric and no known optimal solution — it probably belongs here. Contribute a full task, a rough skeleton, or just the idea. The best benchmarks aren't built by one team. They're built by the people who actually do the work! Github: https://t.co/2sLNlASVcb

my_cat_can_code's tweet photo. [Accidentally deleted this earlier, reposting] 😭
#AutoLab #autoresearch
We've been asking ourselves a question: if AI agents can now run hundreds of experiments overnight, how do we know whether they're actually contributing to research — or just generating noise?

That's why we built AutoLab (https://t.co/aRbV2YeaPf).
Not another pass/fail benchmark, but an open-source environment where agents face the same loop every researcher knows intimately — propose, test, fail, diagnose, revise, repeat.

23 tasks with no answer keys, just open search spaces and real constraints.
We ran 161 evaluations across 7 frontier models, 633M tokens. Every decision, every pivot, every dead end — all openly available in our Live Lab for anyone to replay and learn from.

What we found wasn't about which model is "smartest." It's about a capability we call closed-loop resilience: when incremental refinement stops working, can the agent recognize it and restructure? On one task, two frontier models hit the same wall. One kept pushing within the existing frame. The other stepped back and redesigned the approach entirely. That moment — knowing when to abandon a frame, not just optimize within it — is what separates real research from sophisticated pattern matching.

We believe this matters beyond benchmarking. If agents are genuinely entering the research loop, we want that transition to be measured transparently, built in the open, and shaped by the community — not locked inside any single lab. The scientist doesn't disappear. The loop gets a new participant. And we want to make sure that participant is understood.

This is a joint effort across @Stanford, @MIT, @UW, @UCSanDiego, @ucsantabarbara, @NotreDame, NUS, @Google, @NVIDIA, @IBMResearch, and @bakelab_hq. But 23 tasks is just the start. If you have an optimization problem you've spent weeks grinding on empirically — with a clear metric and no known optimal solution — it probably belongs here. Contribute a full task, a rough skeleton, or just the idea.

The best benchmarks aren't built by one team. They're built by the people who actually do the work!

Github: https://t.co/2sLNlASVcb

my_cat_can_code's tweet photo. [Accidentally deleted this earlier, reposting] 😭
#AutoLab #autoresearch
We've been asking ourselves a question: if AI agents can now run hundreds of experiments overnight, how do we know whether they're actually contributing to research — or just generating noise?

That's why we built AutoLab (https://t.co/aRbV2YeaPf).
Not another pass/fail benchmark, but an open-source environment where agents face the same loop every researcher knows intimately — propose, test, fail, diagnose, revise, repeat.

23 tasks with no answer keys, just open search spaces and real constraints.
We ran 161 evaluations across 7 frontier models, 633M tokens. Every decision, every pivot, every dead end — all openly available in our Live Lab for anyone to replay and learn from.

What we found wasn't about which model is "smartest." It's about a capability we call closed-loop resilience: when incremental refinement stops working, can the agent recognize it and restructure? On one task, two frontier models hit the same wall. One kept pushing within the existing frame. The other stepped back and redesigned the approach entirely. That moment — knowing when to abandon a frame, not just optimize within it — is what separates real research from sophisticated pattern matching.

We believe this matters beyond benchmarking. If agents are genuinely entering the research loop, we want that transition to be measured transparently, built in the open, and shaped by the community — not locked inside any single lab. The scientist doesn't disappear. The loop gets a new participant. And we want to make sure that participant is understood.

This is a joint effort across @Stanford, @MIT, @UW, @UCSanDiego, @ucsantabarbara, @NotreDame, NUS, @Google, @NVIDIA, @IBMResearch, and @bakelab_hq. But 23 tasks is just the start. If you have an optimization problem you've spent weeks grinding on empirically — with a clear metric and no known optimal solution — it probably belongs here. Contribute a full task, a rough skeleton, or just the idea.

The best benchmarks aren't built by one team. They're built by the people who actually do the work!

Github: https://t.co/2sLNlASVcb

my_cat_can_code's tweet photo. [Accidentally deleted this earlier, reposting] 😭
#AutoLab #autoresearch
We've been asking ourselves a question: if AI agents can now run hundreds of experiments overnight, how do we know whether they're actually contributing to research — or just generating noise?

That's why we built AutoLab (https://t.co/aRbV2YeaPf).
Not another pass/fail benchmark, but an open-source environment where agents face the same loop every researcher knows intimately — propose, test, fail, diagnose, revise, repeat.

23 tasks with no answer keys, just open search spaces and real constraints.
We ran 161 evaluations across 7 frontier models, 633M tokens. Every decision, every pivot, every dead end — all openly available in our Live Lab for anyone to replay and learn from.

What we found wasn't about which model is "smartest." It's about a capability we call closed-loop resilience: when incremental refinement stops working, can the agent recognize it and restructure? On one task, two frontier models hit the same wall. One kept pushing within the existing frame. The other stepped back and redesigned the approach entirely. That moment — knowing when to abandon a frame, not just optimize within it — is what separates real research from sophisticated pattern matching.

We believe this matters beyond benchmarking. If agents are genuinely entering the research loop, we want that transition to be measured transparently, built in the open, and shaped by the community — not locked inside any single lab. The scientist doesn't disappear. The loop gets a new participant. And we want to make sure that participant is understood.

This is a joint effort across @Stanford, @MIT, @UW, @UCSanDiego, @ucsantabarbara, @NotreDame, NUS, @Google, @NVIDIA, @IBMResearch, and @bakelab_hq. But 23 tasks is just the start. If you have an optimization problem you've spent weeks grinding on empirically — with a clear metric and no known optimal solution — it probably belongs here. Contribute a full task, a rough skeleton, or just the idea.

The best benchmarks aren't built by one team. They're built by the people who actually do the work!

Github: https://t.co/2sLNlASVcb

my_cat_can_code's tweet photo. [Accidentally deleted this earlier, reposting] 😭
#AutoLab #autoresearch
We've been asking ourselves a question: if AI agents can now run hundreds of experiments overnight, how do we know whether they're actually contributing to research — or just generating noise?

That's why we built AutoLab (https://t.co/aRbV2YeaPf).
Not another pass/fail benchmark, but an open-source environment where agents face the same loop every researcher knows intimately — propose, test, fail, diagnose, revise, repeat.

23 tasks with no answer keys, just open search spaces and real constraints.
We ran 161 evaluations across 7 frontier models, 633M tokens. Every decision, every pivot, every dead end — all openly available in our Live Lab for anyone to replay and learn from.

What we found wasn't about which model is "smartest." It's about a capability we call closed-loop resilience: when incremental refinement stops working, can the agent recognize it and restructure? On one task, two frontier models hit the same wall. One kept pushing within the existing frame. The other stepped back and redesigned the approach entirely. That moment — knowing when to abandon a frame, not just optimize within it — is what separates real research from sophisticated pattern matching.

We believe this matters beyond benchmarking. If agents are genuinely entering the research loop, we want that transition to be measured transparently, built in the open, and shaped by the community — not locked inside any single lab. The scientist doesn't disappear. The loop gets a new participant. And we want to make sure that participant is understood.

This is a joint effort across @Stanford, @MIT, @UW, @UCSanDiego, @ucsantabarbara, @NotreDame, NUS, @Google, @NVIDIA, @IBMResearch, and @bakelab_hq. But 23 tasks is just the start. If you have an optimization problem you've spent weeks grinding on empirically — with a clear metric and no known optimal solution — it probably belongs here. Contribute a full task, a rough skeleton, or just the idea.

The best benchmarks aren't built by one team. They're built by the people who actually do the work!

Github: https://t.co/2sLNlASVcb

2

78

14

63

10K

Zichen Chen (🐱,💖)

@my_cat_can_code

2 months ago

@karpathy's AutoResearch made one thing visible: the frontier question is no longer whether a model can answer once. It is whether it can survive the loop. That is why we built AutoLab. 161 evals | 23 tasks | 7 frontier models | 8,891 trajectories | 633M tokens If you want to watch agents struggle, double down, pivot, and occasionally break through, come watch the Live Lab: https://t.co/v4HRAc8ouz

Zichen Chen (🐱,💖)

@my_cat_can_code

2 months ago

https://t.co/I0dih7bIIq

2

43

7

47

5K

0

15

3

3

996

Zichen Chen (🐱,💖)

@my_cat_can_code

2 months ago

@karpathy's AutoResearch made one thing visible: the frontier question is no longer whether a model can answer once. It is whether it can survive the loop. That is why we built AutoLab. 161 evals | 23 tasks | 7 frontier models | 8,891 trajectories | 633M tokens If you want to watch agents struggle, double down, pivot, and occasionally break through, come watch the Live Lab: https://t.co/v4HRAc7QF1

my_cat_can_code's tweet photo. @karpathy's AutoResearch made one thing visible:

the frontier question is no longer whether a model can answer once.
It is whether it can survive the loop.

That is why we built AutoLab.

161 evals | 23 tasks | 7 frontier models | 8,891 trajectories | 633M tokens

If you want to watch agents struggle, double down, pivot, and occasionally break through, come watch the Live Lab:
https://t.co/v4HRAc7QF1

Zichen Chen (🐱,💖)

@my_cat_can_code

2 months ago

https://t.co/I0dih7bIIq

2

43

7

47

5K

0

8

1

7

875

Zichen Chen (🐱,💖)

@my_cat_can_code

2 months ago

https://t.co/I0dih7bIIq

2

43

7

47

5K

Zichen Chen (🐱,💖)

@my_cat_can_code

2 months ago

@hvngo8 Yes, exactly. A big part of the challenge is not just running more experiments, but knowing when a line of attack has stopped being informative. That is a big reason we built Live Lab, so people can inspect those pivots more directly.

0

0

0

0

27

Zichen Chen (🐱,💖)

@my_cat_can_code

2 months ago

@picocreator Totally agree. Any useful benchmark will eventually face this pressure. That is why we made AutoLab open and trajectory-level, so people can inspect how agents actually search, fail, and adapt.

0

0

0

0

28

Zichen Chen (🐱,💖)

@my_cat_can_code

2 months ago

@lorepunk Would love to hear what you find when you try it. That kind of feedback is exactly what we want AutoLab to grow around.

0

0

0

0

12

Zichen Chen (🐱,💖)

@my_cat_can_code

2 months ago

@MichelIvan92347 Thank you Michel. Really appreciate it 🙌

0

1

0

0

12

Zichen Chen (🐱,💖)

@my_cat_can_code

2 months ago

@deepakdk3478 Exactly. Pass/fail misses a lot. The quality of the loop, diagnosis, and iteration matters too.

0

1

0

0

21

Zichen Chen (🐱,💖)

@my_cat_can_code

2 months ago

@seanwbren Really appreciate the mention, Sean. Love this direction. We built AutoLab exactly for evaluating agents in real iterative loops.

0

1

0

0

22

Last Seen Users on Sotwe

Trends for you

Most Popular Users