We’re releasing Humanity’s Last Exam, a dataset with 3,000 questions developed with hundreds of subject matter experts to capture the human frontier of knowledge and reasoning.
State-of-the-art AIs get <10% accuracy and are highly overconfident.
@ai_risk@scaleai
Interesting categorical framework; it makes precise the distinctions between interpolation (retrieval), extrapolation (composition/search), and discovery.
Also, I imagine that model consistency must enforce the sheaf (gluing) condition in any theory, and not just a presheaf
We've made a breakthrough in self-evolving AI scientists moving from "search" to "principled discovery": Scientific discovery requires that the search space itself changes, and an AI scientist must perceive this shift without intervention. We built an AI that achieves this for the first time with the ability to discover the scientific vocabulary it reasons in. Evidence, tools, artifacts, verifiers, failures & claims become typed provenance. We show three distinct modalities: 1) retrieval, adding known objects; 2) search, exploring a fixed schema; and critically: 3) discovery, a verified regime transition.
We solve the open-endedness evaluation problem by lifting agentic workflows into a typed copresheaf and proving, via a Kan obstruction, that true discovery is not unbounded generation but a verifiable schema expansion: old evidence is transported by Left Kan extension, and genuine novelty is mathematically quantified by the pointwise residual beyond the transported image - separating discovery from mere search and making novelty objective and measurable rather than a subjective judgment or benchmark delta.
Our AI scientist is built in a way that does not pre-conceive the approach it chooses; instead, we endow the system with formal power to adapt, evolve, and reason from first principles. Case studies include:
1⃣Builder/Breaker model that discovers mode-conditioned compliance in proteins;
2⃣CategoryScienceClaw that finds anisotropic fiber-network stiffness rules.
Great work in collaboration with my graduate student @fwang108_@MITdeptofBE
F.Y. Wang & M.J. Buehler, Self-Revising Discovery Systems for Science: A Categorical Framework for Agentic Artificial Intelligence, arXiv:2606.01444, 2026
@Kaju_Nut@agrawalmanindra I never said that. In fact, my point is that Chb does apply, and I was leveraging that to facetiously question how much luck -vs- skill there was in JEE scores. That was before I realized that it can actually be measured instead of just joked about.
@Kaju_Nut@agrawalmanindra Assume scores from mains/adv as two samples of the RV. We can bound/estimate the SD (at each mean) and compare with the mean, to measure luck vs skill.
Maybe this analysis is better be done on the rank (because rank very sensitive to score fluctuations), but this is a start.
@Kaju_Nut@agrawalmanindra Might turn out that the JEE is a better skill estimator than many other exams -- especially given the need for good resolution at the higher end (top 5-10k ranks). Could also study how this has varied across years.
Likely useful inputs for designing JEE/admissions.
@Kaju_Nut@agrawalmanindra Huh? IIUC the ineq bounds the variation across samples of a rand var.
Now that I think about it, one could flip the idea around and use the empirical distribution of score variations to bound how stochastic an estimator of skill the JEE is! Would be a very interesting exercise😃
@ArcanesValor@chamath WDYM the open models aren't on the efficient frontier of price/capability when they are priced at a pittance compared to the frontier models? Are you claiming that frontier-1 models from the big labs are better+cheaper than DeepSeek/etc?
Otherwise the claims are not even wrong, or to put it more scientifically: bullshit... so vague that it's impossible to reach agreement through empirical testing -- the kind that gets junked in scientific communication as unfalsifiable hypotheses.
There's a really easy way to settle it beyond all doubt. Folks just have to share session transcripts and the generated code for review. Post it online, show off your great work, and collect credit for it!
Talk is cheap 🤷♂️
I have said this before, but to those of us using AI systems to get lots of work done reliably and quickly, the people who post online about how AIs still hallucinate constantly, about how they can’t write code, etc., seem equivalent to people trying to convince you that the car you drive to work every day doesn’t exist.
You tell them things like “but I drive a car. I paid money for it. I buy gasoline for it. I could not possibly be working twenty miles away from home if I didn’t have the car?” and they reply that you are imagining having a car, or that you’re lying because you work for a car company.
It is as though these people live in a completely different reality.
@TanayLohia1 Look for someone who's done well in several unrelated things. And good at min one of math/phy/cs (critical thinking). Your challenge will be tempting/ inspiring them to come work with you instead of all the other things they could be doing instead.
@perrymetzger@AdrienLE True, but how can they trust your judgement without you/whoever actually sharing session transcripts and the generated outputs for review? Maybe the output is great, or maybe the user is just getting carried away and inflating grades 🤷♂️ there's an easy way to settle it, you know
@jsensarma@karthiks Interesting to think of interpolations. We could eg. demand that repeat victories recursively grow in size as a fraction of total vote/seat share.
@torchcompiled@ayirpelle You can also add closed form deformations.
Shouldn't be surprising that if you use the exact score you'll overfit / get the exact data. It's the regularisation that gives generalization in ANY model.
@EMostaque@jaminball I'm sure a lot of folks would love to read about it if you write something longform! Might have interesting takeaways for people developing different kinds of breakthrough technology into "platforms", and trying to thread the needle on a workable business model.
@TheStalwart They're both certainly HUNGRIER in their patterns of token consumption. Is that superior, or is something more token efficient supposed to be superior?
@azeem 1. I hope this isn't just a measurement of tokens from Open router/etc
2. Is this direct API tokens or does it include all product surfaces (eg Google search, NotebookLM, etc)