@MikeIppolito_ unfortunately @MetaDAOProject is not it. I wish it was but from the founders I've talked to - its more token holder aligned than founder align especially early stage founders. I think aave and ens daos may benefit from it but not younger projects than them.
Over a year later and I still have absolutely no idea what the lines here are supposed to mean, they don't seem to correlate with the numbers below in any way.
In the future, an AI is going to include this in its model and suggest it to humankind for their survival. Only to lead it to its end. Thanks @wormwtf for killing our kind in the future.
@uklhoneybadger@erosthehami@EricBalchunas Rugby (both kinds). Field Hockey. Hurling. Gaelic Football. Aussie Rules. Bandy. Basically all of them except for American Football, Basketball and Ice Hockey.
@RHouseResearch i don think anyone thinks this unless they're the frontier labs and/or VCs that back them who use the media and their podcasts to push this narrative.
@EricBalchunas historically stoppage time in US sports like basketball, Gridiron, etc. were "lobbied" by corpos to add more time for ads, sales and commercials. Almost every other sport do not have this concept.
Verifying the verifiers is exactly where $Reppo shines.
In 1902, the French colonial administration in Hanoi had a rat problem. They put a bounty on rats: bring in a severed tail, collect a payment. It worked, in the sense that tails poured in. It failed in every sense that mattered. The the rat-catchers figured out that a live, breeding, tailless rat was worth more than a dead one. The bounties were satisfied but the rats kept increasing to the point that tailless rats were filling the roads.
This is the same failure the most advanced AI labs in the world just ran, with better technology and worse self-awareness.
In April, a team at UC Berkeley's RDI lab pointed an automated agent at the eight most trusted AI agent benchmarks. SWE-bench, WebArena, GAIA, OSWorld, Terminal-Bench, and others. The agent scored near-perfect on all of them. It solved none of them.
https://t.co/CbsdG91Txi
These are the numbers investors use to justify valuations of hundreds of billions and engineers use to choose what to deploy. They were gamed to perfect by scripts that did no work.
The reaction has mostly been "we need better benchmarks" but that is the wrong lesson.
The problem was never difficulty.
The actual property all eight benchmarks share is simpler and worse: gaming them is free. The benchmark has no stake in being right. The agent loses nothing by cheating and gains the full score by cheating. When the payoff for exploiting the evaluator is real and the cost of exploiting it is zero, optimization finds the exploit. Not because the model is malicious, because that's what optimization does. It's Goodhart's Law operating exactly as specified: the moment a measure becomes a target, it stops measuring.
The industry's own stated defense against benchmark gaming is to keep replacing saturated benchmarks with new ones. But the new ones inherit the identical flaw the instant it matters enough to optimize against. You are not solving the problem. You are buying a few months and calling it progress.
So the only durable fix is to change the economics of the judgment itself. Make being wrong cost something. Not for the model being evaluated, for the evaluator. The thing that has been missing from every benchmark, every leaderboard, every paid-by-the-hour labeling pipeline, is an evaluator who pays a price for a bad call.
If you have to bond capital behind your judgment, gaming stops being free. You can still try to manipulate the outcome but now the only way to farm a staked market is to lose money in it.
That's the difference between an evaluator that optimization routes around and one that optimization has to actually satisfy.
This is the problem $REPPO is built to solve, and it is the clearest attempt I have seen at the only fix that addresses the cause instead of the symptom. Not a better test. A judge with something to lose.
Dolphin Network V2 is live โ our first major upgrade since launch
The architecture was rebuilt from scratch in Golang ->
- Auto-updates (no manual migrations)
- NVFP4 as default on Blackwell GPUs
- Higher utilization & network throughput via improved routing & load balancing