0error @0error_ob - Twitter Profile

Pinned Tweet

5 months ago

Some phenomena often attributed to scaling limits may instead reflect unablated human-side priors upstream of the model. Below are three candidate variables that seem measurable and ablatable.

1

0

2K

0error

@0error_ob

about 1 month ago

https://t.co/sPh3M0HdDy

0

90

0error

@0error_ob

3 months ago

It’s interesting to frame theory generation as a modeling problem. But new theories don’t just emerge from scaling text—they’re constrained by what we can measure, test, and interface with. So the question becomes: are we scaling models, or scaling the measurement interfaces that make theories testable

0

181

0error

@0error_ob

3 months ago

As models commoditize, execution gets cheaper and agent capabilities converge. The question shifts from "can the model do this" to "does the system learn from its own behavior." Eval is where that difference lives. Replies and DMs open — curious what others are seeing.

0

78

Who to follow

Elias Simos

@eliasimos

Founder of things. I keep trying to do something more glamorous, and somehow always end up in plumbing.

baiyu

@baiyu2140

Initiator of SeeDAO. DAO Gov Architect. NKU/ PKU/USC alumnus

Roman Zinovyev

@RZinovyev

🤖 pd.DataFrame(Quant Research & Development) @DuneAnalytics Wizard🧙 Ex @redoubt_web3, @impossiblefi & Nuclear physics ⚛️

0error

@0error_ob

3 months ago

https://t.co/FPAVRJo9lc

1

0

135

0error

@0error_ob

3 months ago

A quick diagnostic: go through your team's task board Monday. Count eval-related items. Check how many different categories they're spread across. If it's 3+ with no shared owner — you may be dealing with an invisible layer problem.

1

0

89

0error

@0error_ob

3 months ago

Thanks for reporting this — we’ve received your message. We’ve created an internal issue ticket and the team is currently investigating the problem. We’ll update you here as soon as we have more information. Really appreciate you trying OB-1 and flagging this. 您好，我们已经收到您的反馈，并已在内部创建 issue ticket，工程团队正在排查中。一旦有进展我们会第一时间在这里回复您。感谢您使用 OB-1 并帮助我们发现问题。

1

0

38

0error

@0error_ob

3 months ago

https://t.co/u1jsMZQ5Nk

0

108

0error

@0error_ob

3 months ago

https://t.co/uJAEC9Jtiw

0

142

0error

@0error_ob

3 months ago

This is a good reminder that benchmarks are themselves measurement interfaces. When models get close to the ceiling, small assumptions in test design or dataset exposure can dominate the signal. So part of the question may not be model capability, but how reliably our evaluation pipelines capture it.

0

17

0error

@0error_ob

3 months ago

@BoWang87 LLMs learn correlations in how humans describe the world. But scientific data is already filtered through experimental interfaces. So part of the question might be: are we hitting model limits, or the limits of how we observe and encode causality?

0

11

0error

@0error_ob

3 months ago

Infrastructure is a huge part of it. But infrastructure itself doesn’t appear magically. Dense hardware ecosystems are the result of decades of industrial accumulation. In robotics, iteration speed is determined not just by engineers or capital, but by how tightly manufacturing, suppliers, and prototyping loops are integrated. What we’re seeing is not just robotics progress — it’s the compounding effect of an industrial base.

0

21

0error

@0error_ob

3 months ago

@pashov This is not a Claude failure. It’s a verification failure. AI has made code generation cheap, but security review didn’t scale with it. When code is money, “vibe coding” isn’t just a productivity hack — it becomes a systemic risk.

1

0

40

0error

@0error_ob

4 months ago

@ibab This resonates. What feels different to me isn’t raw capability so much as how stable the interface feels when you’re operating inside already-understood structures. When the error surface is a bit smoother, iteration starts to feel much more viable in domains like this.

0

272

0error

@0error_ob

4 months ago

https://t.co/M878i6PwfR

0

236

0error

@0error_ob

4 months ago

This resonates a lot. Many of the failure modes you describe seem less about raw capability, and more about unexamined assumptions at the interface — where agents act without surfacing uncertainty, tradeoffs, or responsibility boundaries. Curious whether making those interfaces more explicit changes the failure profile.

0

1

0

491

0error

@0error_ob

5 months ago

This framing is really helpful — especially the distinction between different modes of misgeneralization and how current judges overcount category-2 behavior. One thing I’m still unsure about: once we fix the evaluator, the remaining variance seems to move upstream — into what the evaluator is allowed to treat as misalignment in the first place. In that sense, the judge prompt (and the spec it encodes) feels like an observer interface: it doesn’t just measure EM, it shapes which generalization modes are even legible as EM. Curious if you see EM as something that can be fully “measured away,” or whether part of it is irreducibly tied to how we define and permit behavior during evaluation.

1

0

106

0error

@0error_ob

5 months ago

That makes sense — those constraints are exactly how people try to keep signal high. I guess what I’m unsure about is whether even with modern benchmarks and reasonable gain thresholds, some upstream shifts still fail to register cleanly once they’re forced through a fixed eval interface.

0

68

0error

@0error_ob

Who to follow

Last Seen Users on Sotwe

Trends for you

Most Popular Users