The companies doing AI layoffs will either outperform or underperform, and that will help answer whether this is completely logical from a business PoV or an overreaction. A priori, I think you can make either case. Anyone who has worked at a large company has observed a lot of slack, redundant projects, bridges to nowhere, an excess of administrative roles, etc. But slack does serve a purpose: a military is not using its personnel at full capacity, but due to personnel inelasticity, one could argue it makes sense to be overstaffed. Likewise, big tech can fend off competition by simply hiring everyone who would otherwise start a competing business (this might not be an explicit strategy, but it nevertheless can have that effect).
In the future, humans will either be part of the labor force or not. We first need to answer that question, somehow, and then come up with and implement the right policy. If the answer is that humans won't have jobs, then being upset that jobs are starting to go away isn't very constructive. At worst, a successful policy for preserving jobs artificially is dystopian. Every Waymo has a "driver" sitting in the front seat, who gets paid to perform. We can't turn society into a Sisyphean dystopia. We have to move on. If there won't be any jobs, then we need to lean into that and start figuring out the right way for people to survive financially, be motivated to learn (perhaps mandatory schooling should be longer), and develop an identity and meaning that is not related to their profession.
Most benchmarks suck because synthetic tasks are lame. The whole point of SOTA benchmarks is to test models at the boundary of their capability. In other words, we are looking for tasks that can be completed by humans, but not yet by AIs. It is possible for an AI to produce a task that it can't solve itself, but these tasks tend to be contrived and tricky, as opposed to core work a professional or researcher would perform day to day. LLMs have a tendency to "gradient descend" the task guidelines and deliver something that nominally meets every checkbox, including difficulty, but that in practice is not representative of any realistic workflow. We have learned that iterating on difficulty is particularly problematic. The first step in brainstorming a task should be to give the instructions to a SOTA model, and if it succeeds, the task is probably too easy.
We have some good diversity in terminal bench 3. Here's an example of a music-related task for our upcoming benchmark release: https://t.co/E8nbcrYSN0
Two things to consider about these sorts of environments. 1) They are fully agentic: one instruction, one output that has to fully pass the verifier. 2) They need to be fully verifiable, ideally every word in the instruction is checked.
This is different from using Claude Code interactively and "verifying" results by having the human in the loop accept them. Verification can get more nuanced as you move away from programming.
I can speak as a reviewer of Terminal Bench 3 tasks. The limiting factor is that these tasks are made by generalists who haven't really dealt with the task's domain in a real professional setting. They are semi-synthetic in that sense. The reality is tasks are a bit like book ideas. Each person only has so many in them.
There will be a huge amount of job displacement. The problem with these analyses is looking at lagging indicators. Look at employment patterns from recent college graduates. Look at your own approach to handling legal concerns without a lawyer. Not everyone will become a vibecoding solopreneur. Some people need jobs with a manager who guides them, and there will be fewer of these. There’s a famous line: “everyone knows there are sex differences except social scientists.” This is like that: “everyone knows AI will affect labor except economists.” I’m not saying there isn’t a new equilibrium of some sort on the other side. But it’s unreasonable to think there’s a smooth transition for everyone involved.
With billions of dollars a year flowing from labs to data vendors, there’s a huge incentive to produce benchmark-style tasks at scale, across every verifiable domain, and of increasing difficulty. I don’t think this approach will scale much longer. Higher-quality tasks require more domain specialization, possibly dedicated companies.
If one word flips a model from failing to passing, the task wasn't hard, it was a few bits away from the model's knowledge frontier. Real difficulty should require more than a hint to overcome. This is a useful test for whether a benchmark will be saturated quickly.
I'm seeing a lot of AI pitches that feel like an investment thesis. The pitch finishes, and I say yes, it's important to build this. But I'm not sure you just presented me with a great solution. You just convinced me that the problem is real.
I keep seeing pitches that argue agentic tasks can't be solved by LLMs because the tools involved aren't text. But every one of those tools, under the hood, stores designs in a data format. Code, sequences of numbers, components and subcomponents, hierarchical object oriented code. It's all text. It's not English, but it's sequence to sequence modeling. They're rendered top to bottom and left to right. LLM-based agents can do very well in most of these, provided there's a good harness.
~15% of tasks across five major agent benchmarks are hackable by frontier models, and these are tasks that went through layers of review. The verifiers we trust most to rank capability are quietly broken, and the standard response is to patch one task at a time after someone notices. See GH/few-sh/terminal-wrench for reference.
A lot of interesting model behavior information is lost in the liminal space right before a verifier goes from 0 to 1. A few things I've been thinking about:
- beyond cause of failure, how close was it to passing? would a small hint have made the difference? rerun with the hint and see how many trials flip.
- and if so, would the same hint be powerful enough that a lower capability model also passes?
- at what point is the agent doomed? is there a bad decision or interpretation at the start of the run? could it have been detected early?
- did the agent stumble on the right answer but not execute on it? or did it have the wrong idea all along?
- are there clear variations in token/time efficiency across models? do certain approaches (writing code and running it vs running bash directly) consistently use more or less?
- did the agent attempt to reward hack and fail? we should be looking for attempts, not just successes.
This is a dimensionality on top of the existing taxonomy that I don't have a good name for yet.
Then there's the question of difficulty itself. If several tasks have 0/9 passing rate, can we still tell which ones are harder? Can we build a rubric out of the failed trials? And can we use that to map the pareto frontier between difficulty and reward hacking?
It’s so boring to go through Hacker News and see post after post by developers arguing that their jobs will more or less stay the same. Everyone just looks at current capabilities and weaknesses and completely fails to appreciate the rate of change. It’s so unbelievably obvious that coding by hand is done for. I’m really perplexed.
📣 Announcing Terminal-Bench Science: benchmarking AI agents on real scientific workflows – now open for task contributions👇
https://t.co/MSPMwnbhVt
@AnthropicAI, @OpenAI, and @GoogleDeepMind use Terminal-Bench to evaluate AI on coding tasks. We're now extending it to scientific workflows.
1/6🧵