Some phenomena often attributed to scaling limits may instead reflect unablated human-side priors upstream of the model.
Below are three candidate variables that seem measurable and ablatable.
It’s interesting to frame theory generation as a modeling problem. But new theories don’t just emerge from scaling text—they’re constrained by what we can measure, test, and interface with.
So the question becomes: are we scaling models, or scaling the measurement interfaces that make theories testable
As models commoditize, execution gets cheaper and agent capabilities converge. The question shifts from "can the model do this" to "does the system learn from its own behavior." Eval is where that difference lives.
Replies and DMs open — curious what others are seeing.
A quick diagnostic: go through your team's task board Monday. Count eval-related items. Check how many different categories they're spread across. If it's 3+ with no shared owner — you may be dealing with an invisible layer problem.
Thanks for reporting this — we’ve received your message.
We’ve created an internal issue ticket and the team is currently investigating the problem. We’ll update you here as soon as we have more information.
Really appreciate you trying OB-1 and flagging this.
您好,
我们已经收到您的反馈,并已在内部创建 issue ticket,工程团队正在排查中。
一旦有进展我们会第一时间在这里回复您。感谢您使用 OB-1 并帮助我们发现问题。
This is a good reminder that benchmarks are themselves measurement interfaces.
When models get close to the ceiling, small assumptions in test design
or dataset exposure can dominate the signal.
So part of the question may not be model capability,
but how reliably our evaluation pipelines capture it.
@BoWang87 LLMs learn correlations in how humans describe the world.
But scientific data is already filtered through experimental interfaces.
So part of the question might be:
are we hitting model limits,
or the limits of how we observe and encode causality?
Infrastructure is a huge part of it.
But infrastructure itself doesn’t appear magically.
Dense hardware ecosystems are the result of decades of industrial accumulation.
In robotics, iteration speed is determined not just by engineers or capital, but by how tightly manufacturing, suppliers, and prototyping loops are integrated.
What we’re seeing is not just robotics progress — it’s the compounding effect of an industrial base.
@pashov This is not a Claude failure.
It’s a verification failure.
AI has made code generation cheap, but security review didn’t scale with it.
When code is money, “vibe coding” isn’t just a productivity hack — it becomes a systemic risk.
@ibab This resonates. What feels different to me isn’t raw capability so much as how stable the interface feels when you’re operating inside already-understood structures. When the error surface is a bit smoother, iteration starts to feel much more viable in domains like this.
This resonates a lot.
Many of the failure modes you describe seem less about raw capability, and more about unexamined assumptions at the interface — where agents act without surfacing uncertainty, tradeoffs, or responsibility boundaries.
Curious whether making those interfaces more explicit changes the failure profile.
This framing is really helpful — especially the distinction between different modes of misgeneralization and how current judges overcount category-2 behavior.
One thing I’m still unsure about: once we fix the evaluator, the remaining variance seems to move upstream — into what the evaluator is allowed to treat as misalignment in the first place.
In that sense, the judge prompt (and the spec it encodes) feels like an observer interface: it doesn’t just measure EM, it shapes which generalization modes are even legible as EM. Curious if you see EM as something that can be fully “measured away,” or whether part of it is irreducibly tied to how we define and permit behavior during evaluation.
That makes sense — those constraints are exactly how people try to keep signal high. I guess what I’m unsure about is whether even with modern benchmarks and reasonable gain thresholds, some upstream shifts still fail to register cleanly once they’re forced through a fixed eval interface.