Richard Zhuang @RichardZ412 - Twitter Profile

Pinned Tweet

4 months ago

Terminal-Bench is a leading benchmark for agents. Unfortunately it’s hard: most small coding agents get very low scores on TB2, so training/system ablations look flat - you can't tell what's working. Announcing OpenThoughts-TBLite - 100 curated TB2-style tasks, difficulty-calibrated so even 8B models can make progress. It's designed to give researchers measurable signal during development, providing faster feedback for experimental iteration while closely tracking true TB2 performance🧵

RichardZ412's tweet photo. Terminal-Bench is a leading benchmark for agents. Unfortunately it’s hard: most small coding agents get very low scores on TB2, so training/system ablations look flat - you can't tell what's working.

Announcing OpenThoughts-TBLite - 100 curated TB2-style tasks, difficulty-calibrated so even 8B models can make progress. It's designed to give researchers measurable signal during development, providing faster feedback for experimental iteration while closely tracking true TB2 performance🧵

11

182

21

93

46K

RichardZ412 retweeted

Ran Li

@ranli_thinker

3 days ago

Over the past 6 months creatives learned how to code; now it’s time for engineers to learn how to storytelling. You must go direct and be interesting. Great trailblazing work from @a16z

1

8

1

0

914

Richard Zhuang

@RichardZ412

3 days ago

@ranli_thinker @a16z Curious which one is harder in your opinion!

1

0

106

RichardZ412 retweeted

Parth Asawa

@pgasawa

7 days ago

The AI community seems to increasingly be heading towards a polarized world when discussing safety and consolidated power. I see this discourse as a false dichotomy, so @profjoeyg and I wrote an essay on how we need to change the conversation (link below).

pgasawa's tweet photo. The AI community seems to increasingly be heading towards a polarized world when discussing safety and consolidated power. I see this discourse as a false dichotomy, so @profjoeyg and I wrote an essay on how we need to change the conversation (link below). https://t.co/A4WeDKzu5j

13

131

32

55

76K

RichardZ412 retweeted

Ruslan Belkin

@ruslansv

13 days ago

5/5 No heavy spoilers here — the value is in the full framing and the engineering implications. Would love your thoughts once you’ve read it. 👉 Read here: https://t.co/h66dQ1Bmvu #AI #LLMs #Agents #MachineLearning

0

2

1

203

RichardZ412 retweeted

Claude

@claudeai

13 days ago

Introducing Claude Fable 5: a Mythos-class model that we’ve made safe for general use. Its capabilities exceed those of any model we’ve ever made generally available.

5K

105K

15K

22K

56M

RichardZ412 retweeted

Xiangjun Ethan Fu

@EthanFu0355525

14 days ago

Huge thanks to @thoma_gu and my fellow collaborators for the amazing support! Couldn’t have done this without you guys!! 🙌

0

4

1

0

410

Richard Zhuang

@RichardZ412

14 days ago

@kevin_x_li +1 this is my primary workflow now too!!

0

2

0

118

Richard Zhuang

@RichardZ412

15 days ago

@distributionat 61A can be hard yes but CS10 is literally at the level of “pre-intro to Python programming” though…

0

105

Richard Zhuang

@RichardZ412

20 days ago

@kevin_x_li GOATED

0

1

0

123

RichardZ412 retweeted

Matej Sirovatka

@m_sirovatka

23 days ago

online eval (noun): the act of DMing your coworker “new checkpoint is up” and waiting for them to offline eval it @mikasenghaas

1

23

1

0

2K

RichardZ412 retweeted

Lin Shi @LinShi592021

24 days ago

CAIS AgenticSE Workshop Keynote Talk: Harbor Adapters & Harbor Index 30min video here: https://t.co/RJvj5o7udy Happy to be invited as a keynote speaker and present our recent study on Harbor Adapters and agentic evaluation!

0

25

6

11

21K

Richard Zhuang

@RichardZ412

26 days ago

Kyle Kuzma is now my favorite player of all time XD

kuz

@kylekuzma

27 days ago

I will be breaking my silence soon & going on TBPN to discuss the future of AI, autonomous defense, robotics and biotech and energy infrastructure @sama, let me know if you want to meet after. We are still in the early innings of American Dynamism. Much to discuss 🤝 🎥

315

4K

121

678

942K

0

4

0

236

RichardZ412 retweeted

Steven Dillmann

@StevenDillmann

about 1 month ago

📣 Announcing Terminal-Bench Science: benchmarking AI agents on real scientific workflows – now open for task contributions👇 https://t.co/MSPMwnbhVt @AnthropicAI, @OpenAI, and @GoogleDeepMind use Terminal-Bench to evaluate AI on coding tasks. We're now extending it to scientific workflows. 1/6🧵

StevenDillmann's tweet photo. 📣 Announcing Terminal-Bench Science: benchmarking AI agents on real scientific workflows – now open for task contributions👇

https://t.co/MSPMwnbhVt

@AnthropicAI, @OpenAI, and @GoogleDeepMind use Terminal-Bench to evaluate AI on coding tasks. We're now extending it to scientific workflows.

1/6🧵

16

495

111

271

907K

Richard Zhuang

@RichardZ412

about 1 month ago

@ChengleiSi BREAKING: SGA got rejected by Anthropic

0

58

RichardZ412 retweeted

Lun Wang

@lunwang1996

about 1 month ago

I’ve left Google DeepMind after an amazing chapter. I’m incredibly grateful for the people I worked with, the things we built, and the lessons I learned from taking frontier AI research into production. DeepMind shaped how I think about research, product, evaluation, and what it takes to build AI systems at real scale. As I wrap up this chapter, I wrote down something I’ve been thinking about a lot: evals. We’re good at evaluating the models we have. We’re much worse at evaluating the models we’re about to build — especially if they cross into a new capability regime. We will have self-evolving models, but before that, we need self-evolving evaluations. https://t.co/F1lUWxDG2D

56

2K

198

1K

617K

Richard Zhuang

@RichardZ412

about 1 month ago

@guohao_li As someone who has worked on model routing I genuinely hope it’s the not the first option :)

0

2

0

120

RichardZ412 retweeted

Ran Li

@ranli_thinker

about 1 month ago

It’s sad to see so many talented people in the silicon valley be trapped in the rat race, chasing fame and money, living a life full of anxiety, and slowly losing empathy for people they now see as the “underclass”. $20M is absolutely life-changing money. But if thats the goal, I’m 10000% sure the crushing emptiness comes after that. Happiness won’t come from that alone. If you have family, take care of them today. If you have dreams, chase them now. Touch grass. Pay attention at dinner. Get good sleep. Don’t defer your life to “once I have $20M, everything will be fixed.” “I think everybody should get rich and famous so they can see that it’s not the answer.” — Jim Carrey, the legendary actor from The Truman Show who suffers from depression.

0

3

1

491

Richard Zhuang

@RichardZ412

about 1 month ago

@kevin_x_li Lfg

0

2

0

78

RichardZ412 retweeted

Kevin Li

@kevin_x_li

about 1 month ago

Introducing SWE-ZERO-12M-trajectories: the largest agentic trace dataset in the open, 5.7x larger than the previous largest. 112B tokens · 12M trajectories · 122K PRs · 3K repos · 16 languages https://t.co/aVqCc4J5tr

19

526

67

397

80K

Richard Zhuang

@RichardZ412

about 1 month ago

I did research with @andrew_li03 at Berkeley and Andrew is one of the sharpest and driven mind I know. Super bullish on @JudgmentLabs's vision that the next real unlock for agents is monitoring and learning from production data. Congrats on the launch!!

Alex Shan

@alexshander03

about 1 month ago

We’re launching @JudgmentLabs today and announcing $32M in funding. As AI agents take on more of the work that creates economic value, they generate massive amounts of production data: the clearest record of how they behave with users, software, and the real world. Judgment builds infrastructure for improving AI agents from production data.

213

1K

153

367

4M

2

23

3

1

2K

Richard Zhuang

@RichardZ412

Last Seen Users on Sotwe

Trends for you

Most Popular Users