Ryan

Verified account

@_PaperMoose_

CTO @heynoah. Built ARC-AGI 2 evals @gregkamrad.

SF

Joined August 2017

1.6K Following

1.3K Followers

3.4K Posts

Pinned Tweet

8 months ago

When you deploy an LLM-as-a-Judge, you’re shipping a classifier into production. Each new version is a hypothesis about how the model interprets the world. It’s data science, just expressed in natural language. Here’s what that looked like for a recent client project where we trained an evaluator to detect a specific agent error type (labeled Category 1 failures) before release. Dataset Dev: 104 labeled traces (46 failures, 58 clean) Eval: 95 labeled traces (34 failures, 61 clean) What We Saw v1 established a clear baseline. v2 drove recall higher but overfit to the dev set, collapsing generalization. v3 made surgical adjustments that clarified “when not to trigger,” improving specificity and stability. v10 is when started to see a step change in the eval set performance, a sign the judge was beginning to generalize. Why It Matters I find that teams often fall into the trap of assuming the llm works without verifying it through hard data. This is a big mistake! Look at the numbers below and see for yourself. Even with careful preparation, the model still fails to correctly classify more than 80 percent of actual labeled errors. A few percent of overfit recall here, a small generalization gap there, and suddenly your CI isn’t filtering what you think it is. Treat them like classifiers: versioned, measured, and tuned against held-out data. That’s how you keep agents honest in production. @HamelHusain @sh_reya

_PaperMoose_'s tweet photo. When you deploy an LLM-as-a-Judge, you’re shipping a classifier into production.

Each new version is a hypothesis about how the model interprets the world.

It’s data science, just expressed in natural language.

Here’s what that looked like for a recent client project where we trained an evaluator to detect a specific agent error type (labeled Category 1 failures) before release.

Dataset

Dev: 104 labeled traces (46 failures, 58 clean)

Eval: 95 labeled traces (34 failures, 61 clean)

What We Saw

v1 established a clear baseline.

v2 drove recall higher but overfit to the dev set, collapsing generalization.

v3 made surgical adjustments that clarified “when not to trigger,” improving specificity and stability.

v10 is when started to see a step change in the eval set performance, a sign the judge was beginning to generalize.

Why It Matters

I find that teams often fall into the trap of assuming the llm works without verifying it through hard data. This is a big mistake! Look at the numbers below and see for yourself. Even with careful preparation, the model still fails to correctly classify more than 80 percent of actual labeled errors.

A few percent of overfit recall here, a small generalization gap there, and suddenly your CI isn’t filtering what you think it is.

Treat them like classifiers: versioned, measured, and tuned against held-out data.

That’s how you keep agents honest in production.

@HamelHusain @sh_reya

8

134

14

118

16K

10 days ago

@levelsio total pyscho

0

1

0

0

97

10 days ago

@SahilBloom I can definitely relate to this. How did you do it?

0

0

0

0

44

10 days ago

@thorstenball Been living in coding agents all year. The thing that decides which ones stick for me isn't raw capability, it's how fast I can review what they produced. That's where my bottleneck moved. Curious how Neo handles the review surface.

0

2

0

0

462

Who to follow

Engineer in pursuit of wisdom

Verified account

Postgres evangelist, open-source ambassador, recovering enterprise architect

Alexander Morales-Panitz

Verified account

AI Research Engineer @AtomicStrata | Humans + agents will outlive our society

22 days ago

reading user transcripts on a sunday is the closest thing founders have to therapy. not because the users say nice things. because you stop being defensive about your product and start hearing what they actually said. distance is a feature. plan for it.

0

0

0

0

42

23 days ago

ai chat ux failure mode i keep seeing: the agent has too much context and uses all of it. users don't want a system prompt rolled into the response. they want the answer. context should be invisible. the moment the agent says "based on what you told me earlier..." you've shown your hand and lost the magic.

0

1

0

0

47

24 days ago

startup math: 100% of 1 thing > 75% of 7 things. the second is dead in the water. the math feels wrong because adding features feels like progress. it's the opposite. each new feature dilutes the one thing you were almost great at.

0

0

0

0

29

24 days ago

the questions you ask politely get answered politely. the questions you ask after four hours of sitting in a room together get answered honestly. most product research stops at hour one. the good stuff starts at hour three. founders who skip the slog get the polite answer and miss the real one.

0

1

0

0

40

25 days ago

every great agent product i've used has the same property: i forget there's a model behind it. the moment you remember it's an llm, the product has lost. the work is making the seams invisible. most teams optimize for showing the seams off.

0

1

0

0

32

25 days ago

scheduling isn't the wedge into executive workflow because scheduling is hard. it's the wedge because the meeting that didn't happen is more expensive than the meeting that ran ten minutes long. founders building "calendly for X" miss this. it's not the meeting. it's the meeting NOT happening.

0

0

0

0

35

26 days ago

LLM costs are now the variable cost. infrastructure is the fixed cost. we've inverted the SaaS economics in two years and most of the playbooks haven't caught up. usage-based pricing isn't a billing model anymore. it's a margin survival strategy.

0

1

0

0

29

26 days ago

founders ship two products at the same time: the one the user pays for. the one they sell to themselves at night to keep going. if those drift apart for too long you quit. the work is keeping them aligned without lying about either.

0

0

0

0

14

26 days ago

calling it review surface area: how fast your team can comment on what claude code just made. markdown is bad at this. html with inline comments (like google docs) is great. specs, evals, prompts, feature proposals, code reviews. all rendered + commentable now. comments feed back into claude code as direct edits. ai for implementation. humans for judgment. that second one breaks the moment review gets painful.

1

0

0

0

43

27 days ago

the eval suite that matters is the one built from user complaints, not the one designed upfront. upfront evals test what you imagined the product would do. complaint evals test what the product actually does in users' hands. these are different surfaces. one is a vanity metric. the other is the product.

0

1

0

1

35

29 days ago

i'm a context-switcher. it took me years to stop apologizing for it. the productivity advice industry is built for momentum people. deep blocks. monk mode. one thing at a time. it's good advice. it's just not for me. i do my best work jumping between threads. each one charges the next. the cost of forcing myself into a single-track day is bigger than the cost of switching. play to your wiring. half the productivity gospel is for someone else's brain.

0

0

0

1

55

about 1 month ago

@businessbarista AI-native

0

0

0

0

34

about 1 month ago

agents that over-explain what they did are showing their training. agents that just did the thing and moved on are showing taste. "here's the draft, does this look okay?" is a model output. "done, you're booked thursday at 2pm" is a product output. these are different things.

1

0

0

0

70

about 1 month ago

qa suite that runs once is a smoke test. qa suite that runs and then auto-updates the prompt against the failures is the actual product. the eval system that improves itself is worth 10x the eval system that just grades. we built the second one and i'd never go back.

0

0

1

0

50

about 1 month ago

agent ux insight i didn't see coming: users tolerate 60+ seconds of latency once they trust the agent. not because they got more patient. because they stopped watching. trust = put the phone away. the product target isn't speed. it's earning enough trust that users are doing something else while you work.

1

2

0

1

51

about 1 month ago

@simonw @jarredsumner @bcherny we see this with noah all the time. the exec doesn't get replaced. they just stop doing the stuff they were putting off anyway. robobun same pattern. jarred has more time to think about bun instead of writing it. not a threat, just a reallocation.

0

0

0

0

470

about 1 month ago

@pmarca yes this is concerning

0

0

0

0

25

Last Seen Users on Sotwe

Trends for you

Most Popular Users