Kelly Buchanan @ekellbuch - Twitter Profile

Pinned Tweet

28 days ago

Very excited to release Terminal-Bench 2.1! Coding agents are among the most economically consequential deployments of LLMs to date. As agents improve, benchmark reliability matters more. We audited TB2.0 and found and corrected issues in 28/89 tasks. 30% of the benchmark! But the rankings survived, absolute scores moved up to 12pp!

ekellbuch's tweet photo. Very excited to release Terminal-Bench 2.1!

Coding agents are among the most economically consequential deployments of LLMs to date. As agents improve, benchmark reliability matters more.

We audited TB2.0 and found and corrected issues in 28/89 tasks. 30% of the benchmark!

But the rankings survived, absolute scores moved up to 12pp!

28

762

74

219

85K

ekellbuch retweeted

Goodfire

@GoodfireAI

28 days ago

Neural networks might speak English, but they think in shapes. Understanding their rich *neural geometry* is key to understanding how they work – and to debugging and controlling them with precision. Starting today, we’re releasing a series of posts on this research agenda. 🧵

307

11K

2K

9K

3M

ekellbuch retweeted

Saining Xie

@sainingxie

1 day ago

how does the brain build and track an internal state of the world from (possibly incomplete and noisy) visual observations? i believe visual state tracking will be the grand challenge for vision in the coming years, and i hope this benchmark can be a useful starting line. enjoy!

7

275

25

111

28K

ekellbuch retweeted

Gabe Pereyra

@gabepereyra

1 day ago

Efficient verification is what makes scaling legal agents practical. Excited to partner with @hwchase17 and the @LangChain Labs team on designing efficient verifiers - sharing early results showing open models can match frontier verifiers at a fraction of the cost on Legal Agent Bench.

2

51

8

48

10K

Who to follow

Dan Biderman

@dan_biderman

building intelligences prev postdoc @HazyResearch, phd @cu_neurotheory, post training @DbrxMosaicAI

Scott Linderman

@scott_linderman

Assistant Professor @Stanford Statistics and @StanfordBrain. AI, Neuroscience, Machine Learning, Statistics. Posts are my own.

Ching Fang (chingfang.bsky.social)

@chingfang17

Member of Technical Staff @GoodfireAI working on AI interpretability for scientific discovery. Prev: @Harvard, neuroscience PhD @Columbia @cu_neurotheory

ekellbuch retweeted

Muyu He

@HeMuyu0327

2 days ago

Some of the more puzzling unpublished observations from our paper: deep attention layers hate the residual stream of V and love it for QK, but if it has to make a choice, it will satisfy V over QK. Translated to finding: if we learn coefficients for residual stream xi and the initial token embedding x0 as two input streams to deep attention layers, the model will give the coefficient for x0 a much larger magnitude. This will mean dominating the input with context-free token information. However, if we learn the coefficients for both at a more fine-grained level for Q, K, and V, the coefficients for x0 is near 0 for both QK, but huge for V. This reveals two surprises. (1) QK needs context information and little original token information. And K does not need the same information as V does (despite some models tying them). (2) Between the two opposite needs, the model is clearly in favor of what benefits V, so V is deemed more important to the optimization goal. These are just the tip of an iceberg, and transformers surely moves in mysterious ways. We will therefore embark on the second part of this journey and, for our next set of experiments, involve this lady (iykyk)... Paper: https://t.co/oFaXEx9Upx

HeMuyu0327's tweet photo. Some of the more puzzling unpublished observations from our paper: deep attention layers hate the residual stream of V and love it for QK, but if it has to make a choice, it will satisfy V over QK.

Translated to finding: if we learn coefficients for residual stream xi and the initial token embedding x0 as two input streams to deep attention layers, the model will give the coefficient for x0 a much larger magnitude. This will mean dominating the input with context-free token information.

However, if we learn the coefficients for both at a more fine-grained level for Q, K, and V, the coefficients for x0 is near 0 for both QK, but huge for V.

This reveals two surprises. (1) QK needs context information and little original token information. And K does not need the same information as V does (despite some models tying them). (2) Between the two opposite needs, the model is clearly in favor of what benefits V, so V is deemed more important to the optimization goal.

These are just the tip of an iceberg, and transformers surely moves in mysterious ways. We will therefore embark on the second part of this journey and, for our next set of experiments, involve this lady (iykyk)...

Paper: https://t.co/oFaXEx9Upx

3

155

9

130

11K

ekellbuch retweeted

MiniMax (official) @MiniMax_AI

3 days ago

Introducing MiniMax M3: The First Open-Weights Model to Combine Three Frontier Capabilities - Coding & Agentic Frontier: 59.0% SWE-Bench Pro, 66.0% Terminal Bench 2.1, 34.8% SWE-fficiency, 28.8% KernelBench Hard, 74.2% MCP Atlas - MiniMax Sparse Attention scales context to 1M - Natively Multimodal from Step Zero API: https://t.co/fHRdSV7BwZ Token Plan: https://t.co/BDCycxepZw 🚀New! MiniMax Code: https://t.co/GvB4YiB6Ul Weights & Tech Report in ~10 Days

MiniMax_AI's tweet photo. Introducing MiniMax M3: The First Open-Weights Model to Combine Three Frontier Capabilities

- Coding & Agentic Frontier: 59.0% SWE-Bench Pro, 66.0% Terminal Bench 2.1, 34.8% SWE-fficiency, 28.8% KernelBench Hard, 74.2% MCP Atlas
- MiniMax Sparse Attention scales context to 1M
- Natively Multimodal from Step Zero

API: https://t.co/fHRdSV7BwZ
Token Plan: https://t.co/BDCycxepZw
🚀New! MiniMax Code: https://t.co/GvB4YiB6Ul

Weights & Tech Report in ~10 Days

529

8K

1K

3K

3M

ekellbuch retweeted

Vals AI

@ValsAI

9 days ago

Pitch us a benchmark or eval technique. We'll fund you to build it. We're opening applications for the Vals Fellowship. 3–6 months working on the hardest open problems in AI evaluation, with the resources to actually solve them. What you get: - Unlimited API credits + budget capacity for GPUs and human data - Vals’ evaluation infrastructure - $1,000–2,500 / week stipend - A network of evals researchers across frontier labs and academia Location: Both remote / in-person in SF applications will be considered

ValsAI's tweet photo. Pitch us a benchmark or eval technique. We'll fund you to build it.

We're opening applications for the Vals Fellowship. 3–6 months working on the hardest open problems in AI evaluation, with the resources to actually solve them.

What you get:
- Unlimited API credits + budget capacity for GPUs and human data
- Vals’ evaluation infrastructure
- $1,000–2,500 / week stipend
- A network of evals researchers across frontier labs and academia

Location: Both remote / in-person in SF applications will be considered

22

511

38

860

96K

ekellbuch retweeted

Cartesia

@cartesia

7 days ago

Cartesia Ink-2 debuts as #1 for accuracy on the brand-new streaming speech-to-text leaderboard from @ArtificialAnlys! We designed Ink-2 from the ground up for voice agents - with low latency, eager transcripts, and semantic endpointing.

6

120

36

49

60K

ekellbuch retweeted

Jon Saad-Falcon

@JonSaadFalcon

7 days ago

The dominant story in AI has been the growing cloud: bigger clusters, larger models, more gigawatts. We believe the future is in the opposite direction: on-device inference, smaller models, watts instead of gigawatts. Today we're releasing @OpenJarvisAI v1.0: a personal AI assistant that lives, learns, and works on your device.

49

596

91

566

144K

ekellbuch retweeted

Cognition @cognition

8 days ago

1/ We’ve raised over $1B at a $26B valuation, led by @Lux_Capital, @generalcatalyst, and @8vc. Our enterprise usage has grown >10x since the start of this year, and our run-rate revenue grew to $492 M. We launched Devin two years ago as the first AI software engineer. Since then, cloud agents have gone from niche to mainstream, and today they are the fastest growing way to create software.

cognition's tweet photo. 1/ We’ve raised over $1B at a $26B valuation, led by @Lux_Capital, @generalcatalyst, and @8vc.

Our enterprise usage has grown >10x since the start of this year, and our run-rate revenue grew to $492 M.

We launched Devin two years ago as the first AI software engineer. Since then, cloud agents have gone from niche to mainstream, and today they are the fastest growing way to create software.

165

2K

200

463

855K

ekellbuch retweeted

hardmaru

@hardmaru

23 days ago

The agent also reimplemented the “Blues Improvisation” experiment by @douglas_eck and @SchmidhuberAI in 2002 which show that LSTMs can learn temporal structure in music. Finding temporal structure in music: Blues improvisation with LSTM recurrent networks https://t.co/2oU4iIcJEr

6

70

12

23

10K

ekellbuch retweeted

Binfeng Xu

@billxbf

9 days ago

Excited to release 🌟Polar🌟, our Agent RL rollout infra for real-world harnesses. Be it Codex, Claude Code, OpenClaw, Hermes, or your self-made ones 🔥 -- Polar takes your harnesses directly as training environments without code change. Find a problem, design the harness, and train your own agents! 🧵

billxbf's tweet photo. Excited to release 🌟Polar🌟, our Agent RL rollout infra for real-world harnesses. Be it Codex, Claude Code, OpenClaw, Hermes, or your self-made ones 🔥 -- Polar takes your harnesses directly as training environments without code change.

Find a problem, design the harness, and train your own agents! 🧵

25

896

144

943

129K

ekellbuch retweeted

Applied Compute @appliedcompute

12 days ago

Some enterprise tasks are challenging to hill-climb with RL-based methods since they involve very out-of-distribution behavior. On-policy self-distillation (OPSD) gives a model learning signal for every token it writes, far richer than the single scalar reward of RL. But that channel is noisy: most tokens don't reflect the behavior you're after. We introduce Relevance-Masked Self-Distillation (RMSD), which uses a two-step filtered loss mask to cut through the noise and find the tokens with the highest signal. Compared to OPSD it trains more stably, provides higher data efficiency, and reaches a higher performance ceiling.

9

296

27

286

85K

ekellbuch retweeted

Rajat V D @rajat_vd

17 days ago

I defended my PhD last week! The main focus of the talk was my recent work on fast sketching on GPUs via co-design, FlashSketch (https://t.co/7wXxsTFGsn, ICML 2026 Spotlight). Recording: https://t.co/mMwy3vZFO7 Step through the animated slides: https://t.co/M9jYSlfqty

2

159

16

113

16K

Kelly Buchanan

@ekellbuch

13 days ago

Amazing to see Terminal Bench 2.1 at the top of the leaderboard and Congratulations on the results Google!

Jeff Dean

@JeffDean

16 days ago

1/ Today at #GoogleIO, we’re releasing Gemini 3.5, our latest family of models combining frontier intelligence with action. We’re starting by releasing 3.5 Flash, which is built to help you execute complex, long-horizon agentic workflows. Gemini 3.5 Flash is our strongest model for coding and agent https://t.co/m62cBJhIjJ outscores 3.1 Pro on agentic and coding benchmarks like Terminal-Bench and MCP Atlas, while running 4x faster than other frontier models. Used in Google Antigravity, 3.5 Flash is even further optimized to be up to 12x faster. It’s a powerful engine to deploy sub-agents that collaborate, run high-frequency iterative loops, and solve real-world problems at scale. Some highlights we’re excited about 🔽

JeffDean's tweet photo. 1/ Today at #GoogleIO, we’re releasing Gemini 3.5, our latest family of models combining frontier intelligence with action.

We’re starting by releasing 3.5 Flash, which is built to help you execute complex, long-horizon agentic workflows.

Gemini 3.5 Flash is our strongest model for coding and agent https://t.co/m62cBJhIjJ outscores 3.1 Pro on agentic and coding benchmarks like Terminal-Bench and MCP Atlas, while running 4x faster than other frontier models.

Used in Google Antigravity, 3.5 Flash is even further optimized to be up to 12x faster. It’s a powerful engine to deploy sub-agents that collaborate, run high-frequency iterative loops, and solve real-world problems at scale.

Some highlights we’re excited about 🔽

83

1K

196

230

132K

0

47

1

2

3K

ekellbuch retweeted

Demis Hassabis

@demishassabis

15 days ago

Gemini 3.5 Flash is amazing! - Performs better than 3.1 Pro on coding & agentic tasks - 4x faster than other frontier models - 12x faster in @antigravity - 800 tokens/sec! - Often at less than half the cost And Pro to come… Try it in @antigravity, @GeminiApp & more - enjoy!

demishassabis's tweet photo. Gemini 3.5 Flash is amazing!

- Performs better than 3.1 Pro on coding & agentic tasks
- 4x faster than other frontier models
- 12x faster in @antigravity - 800 tokens/sec!
- Often at less than half the cost

And Pro to come…

Try it in @antigravity, @GeminiApp & more - enjoy! https://t.co/ujGtiDBfSL

314

3K

260

254

256K

ekellbuch retweeted

Mehrdad Farajtabar @MFarajtabar

23 days ago

🧵 1/11 Everyone's doing on-policy distillation now (Qwen3, Deepseek V4, GLM-5). But here's what nobody's asking: at any given token or for a question and a teacher, when does the teacher's guidance actually help, and when does it quietly make things worse? We found a way to answer this. No training needed!

MFarajtabar's tweet photo. 🧵 1/11 Everyone's doing on-policy distillation now (Qwen3, Deepseek V4, GLM-5).

But here's what nobody's asking: at any given token or for a question and a teacher, when does the teacher's guidance actually help, and when does it quietly make things worse?

We found a way to answer this. No training needed!

4

434

51

512

29K

ekellbuch retweeted

Oliver

@olvrgln

about 1 month ago

Introducing Mesa: the most powerful filesystem ever built, designed specifically for enterprise AI agents. Every team building agents eventually hits the same wall: where do the files live? Not the chat history, the actual artifacts the agent works on. > The contracts your agent redlined > The claim files it updated > The 200-page audit report it edited overnight while you were asleep Today those documents live in a sandbox that dies in 30 minutes, an S3 bucket where concurrent writes clobber each other, or a GitHub repo that was never built to absorb agent-scale traffic. So we built Mesa. The world's first POSIX-compatible filesystem with built-in version control, designed from the ground up for agents. You mount it into your sandbox like any other filesystem. Your agent reads and writes files normally. Behind the scenes every change is versioned, branchable, reviewable, and rollback-able — like a codebase, for any file type. Mesa provides – Branches so agents work in parallel without locking – Durable storage that survives sandbox death – Sparse materialization so massive document sets load instantly – Fine-grained access control per agent – Full history for human review and audit Design partners are running Mesa in production across legal, healthcare, GTM, business ops, and coding agents. Private beta is open: link in the comments

olvrgln's tweet photo. Introducing Mesa: the most powerful filesystem ever built, designed specifically for enterprise AI agents.

Every team building agents eventually hits the same wall: where do the files live?

Not the chat history, the actual artifacts the agent works on.
> The contracts your agent redlined
> The claim files it updated
> The 200-page audit report it edited overnight while you were asleep

Today those documents live in a sandbox that dies in 30 minutes, an S3 bucket where concurrent writes clobber each other, or a GitHub repo that was never built to absorb agent-scale traffic.

So we built Mesa.

The world's first POSIX-compatible filesystem with built-in version control, designed from the ground up for agents. You mount it into your sandbox like any other filesystem. Your agent reads and writes files normally. Behind the scenes every change is versioned, branchable, reviewable, and rollback-able — like a codebase, for any file type.

Mesa provides
– Branches so agents work in parallel without locking
– Durable storage that survives sandbox death
– Sparse materialization so massive document sets load instantly
– Fine-grained access control per agent
– Full history for human review and audit

Design partners are running Mesa in production across legal, healthcare, GTM, business ops, and coding agents.

Private beta is open: link in the comments

123

2K

159

4K

640K

ekellbuch retweeted

Exa

@ExaAILabs

15 days ago

We raised $250M in Series C funding at a $2.2B valuation, led by a16z. Exa is a search lab organizing the web's data for agents.

157

2K

169

815

1M

ekellbuch retweeted

Maksym Andriushchenko

@maksym_andr

15 days ago

💥Today we release InferenceBench, our next benchmark after PostTrainBench that measures progress on AI R&D automation. AI R&D automation will very likely unfold gradually, starting from “boring” tasks like inference speed optimization that are very easily verifiable (accuracy + inference time). We show a rather negative result for current frontier agents. They are not good at system-level engineering and managing complex dependencies. They do show non-trivial performance, but they fail compared to a simple baseline: hyperparameter tuning of vLLM/SGLang hyperparameters. Importantly, InferenceBench tests *open-ended* inference optimization capabilities. This is different from more narrow benchmarks like KernelBench that only let agents optimize kernels (which is a very valuable task, too!). The benchmark is intentionally open-ended, so the poor performance of the agents is not an underelicitation issue. The agents have everything needed to succeed, but they still fail because they are not yet reliable enough for this task. Our results suggest an inverse scaling phenomenon: Claude Sonnet 4.6 and GLM-5 rank highly because they more often preserve simple, valid, high-performing final servers, while several larger models show stronger peak runs but lose utility through brittle final-state choices. This contrasts with benchmarks where rankings track raw capability (e.g., SWE-Bench, Terminal-Bench, PostTrainBench, FrontierSWE). One of the primary bottlenecks we have clearly observed is the lack of diversity of strategies: nearly all agents just use vLLM, without exploring alternatives. Overall, proper exploration is lacking: the current agents are not ready to tackle broad enough goals and get stuck after the first found solution (such as vLLM). I’m sure future agents will do much better, but here is where we are now. This benchmark is our 2nd one in a suite of benchmarks that will track the progress on AI R&D automation. We will develop many more benchmarks that will cover different aspects of AI R&D automation, culminating in recursive self-improvement. Stay tuned!

maksym_andr's tweet photo. 💥Today we release InferenceBench, our next benchmark after PostTrainBench that measures progress on AI R&D automation.

AI R&D automation will very likely unfold gradually, starting from “boring” tasks like inference speed optimization that are very easily verifiable (accuracy + inference time). We show a rather negative result for current frontier agents. They are not good at system-level engineering and managing complex dependencies. They do show non-trivial performance, but they fail compared to a simple baseline: hyperparameter tuning of vLLM/SGLang hyperparameters.

Importantly, InferenceBench tests *open-ended* inference optimization capabilities. This is different from more narrow benchmarks like KernelBench that only let agents optimize kernels (which is a very valuable task, too!). The benchmark is intentionally open-ended, so the poor performance of the agents is not an underelicitation issue. The agents have everything needed to succeed, but they still fail because they are not yet reliable enough for this task.

Our results suggest an inverse scaling phenomenon: Claude Sonnet 4.6 and GLM-5 rank highly because they more often preserve simple, valid, high-performing final servers, while several larger models show stronger peak runs but lose utility through brittle final-state choices. This contrasts with benchmarks where rankings track raw capability (e.g., SWE-Bench, Terminal-Bench, PostTrainBench, FrontierSWE).

One of the primary bottlenecks we have clearly observed is the lack of diversity of strategies: nearly all agents just use vLLM, without exploring alternatives. Overall, proper exploration is lacking: the current agents are not ready to tackle broad enough goals and get stuck after the first found solution (such as vLLM). I’m sure future agents will do much better, but here is where we are now.

This benchmark is our 2nd one in a suite of benchmarks that will track the progress on AI R&D automation. We will develop many more benchmarks that will cover different aspects of AI R&D automation, culminating in recursive self-improvement. Stay tuned!

12

346

48

212

42K

ekellbuch retweeted

Modal @modal

15 days ago

Frontier models set the floor. Specialized models raise the ceiling. With Modal, @AppliedCompute is training custom agent workforces for companies like DoorDash, Mercor, and Cognition.

5

127

18

40

37K

Kelly Buchanan

@ekellbuch

Who to follow

Last Seen Users on Sotwe

Trends for you

Most Popular Users