Terminal-Bench is a leading benchmark for agents. Unfortunately it’s hard: most small coding agents get very low scores on TB2, so training/system ablations look flat - you can't tell what's working.
Announcing OpenThoughts-TBLite - 100 curated TB2-style tasks, difficulty-calibrated so even 8B models can make progress. It's designed to give researchers measurable signal during development, providing faster feedback for experimental iteration while closely tracking true TB2 performance🧵
Over the past 6 months creatives learned how to code; now it’s time for engineers to learn how to storytelling.
You must go direct and be interesting. Great trailblazing work from @a16z
The AI community seems to increasingly be heading towards a polarized world when discussing safety and consolidated power. I see this discourse as a false dichotomy, so @profjoeyg and I wrote an essay on how we need to change the conversation (link below).
5/5 No heavy spoilers here — the value is in the full framing and the engineering implications.
Would love your thoughts once you’ve read it.
👉 Read here: https://t.co/h66dQ1Bmvu
#AI#LLMs#Agents#MachineLearning
Introducing Claude Fable 5: a Mythos-class model that we’ve made safe for general use.
Its capabilities exceed those of any model we’ve ever made generally available.
CAIS AgenticSE Workshop Keynote Talk: Harbor Adapters & Harbor Index
30min video here:
https://t.co/RJvj5o7udy
Happy to be invited as a keynote speaker and present our recent study on Harbor Adapters and agentic evaluation!
I will be breaking my silence soon & going on TBPN to discuss the future of AI, autonomous defense, robotics and biotech and energy infrastructure
@sama, let me know if you want to meet after. We are still in the early innings of American Dynamism. Much to discuss 🤝 🎥
📣 Announcing Terminal-Bench Science: benchmarking AI agents on real scientific workflows – now open for task contributions👇
https://t.co/MSPMwnbhVt
@AnthropicAI, @OpenAI, and @GoogleDeepMind use Terminal-Bench to evaluate AI on coding tasks. We're now extending it to scientific workflows.
1/6🧵
I’ve left Google DeepMind after an amazing chapter.
I’m incredibly grateful for the people I worked with, the things we built, and the lessons I learned from taking frontier AI research into production. DeepMind shaped how I think about research, product, evaluation, and what it takes to build AI systems at real scale.
As I wrap up this chapter, I wrote down something I’ve been thinking about a lot: evals.
We’re good at evaluating the models we have. We’re much worse at evaluating the models we’re about to build — especially if they cross into a new capability regime. We will have self-evolving models, but before that, we need self-evolving evaluations.
https://t.co/F1lUWxDG2D
It’s sad to see so many talented people in the silicon valley be trapped in the rat race, chasing fame and money, living a life full of anxiety, and slowly losing empathy for people they now see as the “underclass”.
$20M is absolutely life-changing money. But if thats the goal, I’m 10000% sure the crushing emptiness comes after that.
Happiness won’t come from that alone.
If you have family, take care of them today. If you have dreams, chase them now. Touch grass. Pay attention at dinner. Get good sleep. Don’t defer your life to “once I have $20M, everything will be fixed.”
“I think everybody should get rich and famous so they can see that it’s not the answer.” — Jim Carrey, the legendary actor from The Truman Show who suffers from depression.
Introducing SWE-ZERO-12M-trajectories: the largest agentic trace dataset in the open, 5.7x larger than the previous largest.
112B tokens · 12M trajectories · 122K PRs · 3K repos · 16 languages
https://t.co/aVqCc4J5tr
I did research with @andrew_li03 at Berkeley and Andrew is one of the sharpest and driven mind I know. Super bullish on @JudgmentLabs's vision that the next real unlock for agents is monitoring and learning from production data. Congrats on the launch!!
We’re launching @JudgmentLabs today and announcing $32M in funding.
As AI agents take on more of the work that creates economic value, they generate massive amounts of production data: the clearest record of how they behave with users, software, and the real world.
Judgment builds infrastructure for improving AI agents from production data.