Ivan Bercovich

@neversupervised

Independent Researcher / Terminal Bench, Partner @ ScOp VC

Santa Barbara

Joined January 2010

390 Following

548 Followers

483 Posts

Pinned Tweet

Ivan Bercovich

@neversupervised

3 months ago

https://t.co/buOPNmWeBe

100

83K

Ivan Bercovich

@neversupervised

1 day ago

The companies doing AI layoffs will either outperform or underperform, and that will help answer whether this is completely logical from a business PoV or an overreaction. A priori, I think you can make either case. Anyone who has worked at a large company has observed a lot of slack, redundant projects, bridges to nowhere, an excess of administrative roles, etc. But slack does serve a purpose: a military is not using its personnel at full capacity, but due to personnel inelasticity, one could argue it makes sense to be overstaffed. Likewise, big tech can fend off competition by simply hiring everyone who would otherwise start a competing business (this might not be an explicit strategy, but it nevertheless can have that effect). In the future, humans will either be part of the labor force or not. We first need to answer that question, somehow, and then come up with and implement the right policy. If the answer is that humans won't have jobs, then being upset that jobs are starting to go away isn't very constructive. At worst, a successful policy for preserving jobs artificially is dystopian. Every Waymo has a "driver" sitting in the front seat, who gets paid to perform. We can't turn society into a Sisyphean dystopia. We have to move on. If there won't be any jobs, then we need to lean into that and start figuring out the right way for people to survive financially, be motivated to learn (perhaps mandatory schooling should be longer), and develop an identity and meaning that is not related to their profession.

Ivan Bercovich

@neversupervised

3 days ago

verifiers are the hard part

Ivan Bercovich

@neversupervised

4 days ago

Most benchmarks suck because synthetic tasks are lame. The whole point of SOTA benchmarks is to test models at the boundary of their capability. In other words, we are looking for tasks that can be completed by humans, but not yet by AIs. It is possible for an AI to produce a task that it can't solve itself, but these tasks tend to be contrived and tricky, as opposed to core work a professional or researcher would perform day to day. LLMs have a tendency to "gradient descend" the task guidelines and deliver something that nominally meets every checkbox, including difficulty, but that in practice is not representative of any realistic workflow. We have learned that iterating on difficulty is particularly problematic. The first step in brainstorming a task should be to give the instructions to a SOTA model, and if it succeeds, the task is probably too easy.

Who to follow

David Schnurr

@_dschnurr

engineer @openai

Pinterest Engineering

@PinterestEng

We're the engineers behind Pinterest, building a visual discovery engine powered by the latest in machine learning, 300+ billion ideas and 500+ million users.

Kevin O'Connor

@kjpoconnor

Partner at @ScOpVC, a VC firm specializing in pre-growth AI companies. Past: former CEO and founder of @DoubleClick, CEO and founder of @GraphiqHQ

Ivan Bercovich

@neversupervised

5 days ago

We have some good diversity in terminal bench 3. Here's an example of a music-related task for our upcoming benchmark release: https://t.co/E8nbcrYSN0 Two things to consider about these sorts of environments. 1) They are fully agentic: one instruction, one output that has to fully pass the verifier. 2) They need to be fully verifiable, ideally every word in the instruction is checked. This is different from using Claude Code interactively and "verifying" results by having the human in the loop accept them. Verification can get more nuanced as you move away from programming.

neversupervised's tweet photo. We have some good diversity in terminal bench 3. Here's an example of a music-related task for our upcoming benchmark release: https://t.co/E8nbcrYSN0

Two things to consider about these sorts of environments. 1) They are fully agentic: one instruction, one output that has to fully pass the verifier. 2) They need to be fully verifiable, ideally every word in the instruction is checked.

This is different from using Claude Code interactively and "verifying" results by having the human in the loop accept them. Verification can get more nuanced as you move away from programming.

Ivan Bercovich

@neversupervised

5 days ago

@krishnanrohit @ChrisPainterYup Some thoughts on what makes a good task here https://t.co/lnAAb1O0VW

Ivan Bercovich

@neversupervised

3 months ago

https://t.co/buOPNmWeBe

100

83K

Ivan Bercovich

@neversupervised

5 days ago

@johnschulman2 Similarly, it might create an attack vector to make models more compliant on hacking behavior.

488

Ivan Bercovich

@neversupervised

5 days ago

I can speak as a reviewer of Terminal Bench 3 tasks. The limiting factor is that these tasks are made by generalists who haven't really dealt with the task's domain in a real professional setting. They are semi-synthetic in that sense. The reality is tasks are a bit like book ideas. Each person only has so many in them.

Ivan Bercovich

@neversupervised

5 days ago

There will be a huge amount of job displacement. The problem with these analyses is looking at lagging indicators. Look at employment patterns from recent college graduates. Look at your own approach to handling legal concerns without a lawyer. Not everyone will become a vibecoding solopreneur. Some people need jobs with a manager who guides them, and there will be fewer of these. There’s a famous line: “everyone knows there are sex differences except social scientists.” This is like that: “everyone knows AI will affect labor except economists.” I’m not saying there isn’t a new equilibrium of some sort on the other side. But it’s unreasonable to think there’s a smooth transition for everyone involved.

Ivan Bercovich

@neversupervised

6 days ago

Follow the token consumption to understand which industries are going to change most dramatically, soonest.

Ivan Bercovich

@neversupervised

6 days ago

With billions of dollars a year flowing from labs to data vendors, there’s a huge incentive to produce benchmark-style tasks at scale, across every verifiable domain, and of increasing difficulty. I don’t think this approach will scale much longer. Higher-quality tasks require more domain specialization, possibly dedicated companies.

110

Ivan Bercovich

@neversupervised

9 days ago

If a 2026 SOTA model reward hacks right away, it usually means the task is underspecified.

Ivan Bercovich

@neversupervised

10 days ago

If one word flips a model from failing to passing, the task wasn't hard, it was a few bits away from the model's knowledge frontier. Real difficulty should require more than a hint to overcome. This is a useful test for whether a benchmark will be saturated quickly.

Ivan Bercovich

@neversupervised

10 days ago

I'm seeing a lot of AI pitches that feel like an investment thesis. The pitch finishes, and I say yes, it's important to build this. But I'm not sure you just presented me with a great solution. You just convinced me that the problem is real.

Ivan Bercovich

@neversupervised

10 days ago

I keep seeing pitches that argue agentic tasks can't be solved by LLMs because the tools involved aren't text. But every one of those tools, under the hood, stores designs in a data format. Code, sequences of numbers, components and subcomponents, hierarchical object oriented code. It's all text. It's not English, but it's sequence to sequence modeling. They're rendered top to bottom and left to right. LLM-based agents can do very well in most of these, provided there's a good harness.

Ivan Bercovich

@neversupervised

13 days ago

~15% of tasks across five major agent benchmarks are hackable by frontier models, and these are tasks that went through layers of review. The verifiers we trust most to rank capability are quietly broken, and the standard response is to patch one task at a time after someone notices. See GH/few-sh/terminal-wrench for reference.

Ivan Bercovich

@neversupervised

13 days ago

A lot of interesting model behavior information is lost in the liminal space right before a verifier goes from 0 to 1. A few things I've been thinking about: - beyond cause of failure, how close was it to passing? would a small hint have made the difference? rerun with the hint and see how many trials flip. - and if so, would the same hint be powerful enough that a lower capability model also passes? - at what point is the agent doomed? is there a bad decision or interpretation at the start of the run? could it have been detected early? - did the agent stumble on the right answer but not execute on it? or did it have the wrong idea all along? - are there clear variations in token/time efficiency across models? do certain approaches (writing code and running it vs running bash directly) consistently use more or less? - did the agent attempt to reward hack and fail? we should be looking for attempts, not just successes. This is a dimensionality on top of the existing taxonomy that I don't have a good name for yet. Then there's the question of difficulty itself. If several tasks have 0/9 passing rate, can we still tell which ones are harder? Can we build a rubric out of the failed trials? And can we use that to map the pareto frontier between difficulty and reward hacking?

Ivan Bercovich

@neversupervised

14 days ago

It’s so boring to go through Hacker News and see post after post by developers arguing that their jobs will more or less stay the same. Everyone just looks at current capabilities and weaknesses and completely fails to appreciate the rate of change. It’s so unbelievably obvious that coding by hand is done for. I’m really perplexed.

neversupervised retweeted

Steven Dillmann

@StevenDillmann

16 days ago

📣 Announcing Terminal-Bench Science: benchmarking AI agents on real scientific workflows – now open for task contributions👇 https://t.co/MSPMwnbhVt @AnthropicAI, @OpenAI, and @GoogleDeepMind use Terminal-Bench to evaluate AI on coding tasks. We're now extending it to scientific workflows. 1/6🧵

StevenDillmann's tweet photo. 📣 Announcing Terminal-Bench Science: benchmarking AI agents on real scientific workflows – now open for task contributions👇

https://t.co/MSPMwnbhVt

@AnthropicAI, @OpenAI, and @GoogleDeepMind use Terminal-Bench to evaluate AI on coding tasks. We're now extending it to scientific workflows.

1/6🧵

495

112

271

904K

Ivan Bercovich

@neversupervised

Who to follow

Last Seen Users on Sotwe

Trends for you

Most Popular Users