Siva

@ergodicthought

Soul of an artist in the mind of a scientist | “... specialization is for insects.”

Bengaluru, India

Joined October 2008

2.6K Following

496 Followers

3.7K Posts

Pinned Tweet

Siva @ergodicthought

over 1 year ago

1/ Let's unwrap why the notion of such an evaluation benchmark for AI models is irredeemably flawed, and how it promotes cargo cult mania...

Dan Hendrycks

@hendrycks

over 1 year ago

We’re releasing Humanity’s Last Exam, a dataset with 3,000 questions developed with hundreds of subject matter experts to capture the human frontier of knowledge and reasoning. State-of-the-art AIs get <10% accuracy and are highly overconfident. @ai_risk @scaleai

hendrycks's tweet photo. We’re releasing Humanity’s Last Exam, a dataset with 3,000 questions developed with hundreds of subject matter experts to capture the human frontier of knowledge and reasoning.

State-of-the-art AIs get <10% accuracy and are highly overconfident.
@ai_risk @scaleai

202

751

15K

Siva @ergodicthought

3 minutes ago

@sanjeevsanyal @sreemoytalukdar @TheEconomist Seems like a submarine article to legitimise the CJP narrative

Siva @ergodicthought

43 minutes ago

Interesting categorical framework; it makes precise the distinctions between interpolation (retrieval), extrapolation (composition/search), and discovery. Also, I imagine that model consistency must enforce the sheaf (gluing) condition in any theory, and not just a presheaf

Markus J. Buehler

@ProfBuehlerMIT

2 days ago

We've made a breakthrough in self-evolving AI scientists moving from "search" to "principled discovery": Scientific discovery requires that the search space itself changes, and an AI scientist must perceive this shift without intervention. We built an AI that achieves this for the first time with the ability to discover the scientific vocabulary it reasons in. Evidence, tools, artifacts, verifiers, failures & claims become typed provenance. We show three distinct modalities: 1) retrieval, adding known objects; 2) search, exploring a fixed schema; and critically: 3) discovery, a verified regime transition. We solve the open-endedness evaluation problem by lifting agentic workflows into a typed copresheaf and proving, via a Kan obstruction, that true discovery is not unbounded generation but a verifiable schema expansion: old evidence is transported by Left Kan extension, and genuine novelty is mathematically quantified by the pointwise residual beyond the transported image - separating discovery from mere search and making novelty objective and measurable rather than a subjective judgment or benchmark delta. Our AI scientist is built in a way that does not pre-conceive the approach it chooses; instead, we endow the system with formal power to adapt, evolve, and reason from first principles. Case studies include: 1⃣Builder/Breaker model that discovers mode-conditioned compliance in proteins; 2⃣CategoryScienceClaw that finds anisotropic fiber-network stiffness rules. Great work in collaboration with my graduate student @fwang108_ @MITdeptofBE F.Y. Wang & M.J. Buehler, Self-Revising Discovery Systems for Science: A Categorical Framework for Agentic Artificial Intelligence, arXiv:2606.01444, 2026

364

743K

Siva @ergodicthought

about 1 hour ago

@Kaju_Nut @agrawalmanindra I never said that. In fact, my point is that Chb does apply, and I was leveraging that to facetiously question how much luck -vs- skill there was in JEE scores. That was before I realized that it can actually be measured instead of just joked about.

Who to follow

Shashank Mehta

@shashankmehta05

Now: Tinkering | Past: VP & GM @RazorpayX & founding team @Razorpay. Found working, deep diving about the latest tech, napping or travelling.

Jamie

@Jamie_Haszel

A shiftless person, roving and magotie-headed, and sometimes little better than crazed

Renée Rosillo 🧠✨

@rerosillo

sapere aude. giving agents time series forecasting abilities @timecopilot. mostly personal opinions.

Siva @ergodicthought

about 1 hour ago

@Kaju_Nut @agrawalmanindra Assume scores from mains/adv as two samples of the RV. We can bound/estimate the SD (at each mean) and compare with the mean, to measure luck vs skill. Maybe this analysis is better be done on the rank (because rank very sensitive to score fluctuations), but this is a start.

Siva @ergodicthought

about 1 hour ago

@Kaju_Nut @agrawalmanindra Might turn out that the JEE is a better skill estimator than many other exams -- especially given the need for good resolution at the higher end (top 5-10k ranks). Could also study how this has varied across years. Likely useful inputs for designing JEE/admissions.

Siva @ergodicthought

about 2 hours ago

@Kaju_Nut @agrawalmanindra Huh? IIUC the ineq bounds the variation across samples of a rand var. Now that I think about it, one could flip the idea around and use the empirical distribution of score variations to bound how stochastic an estimator of skill the JEE is! Would be a very interesting exercise😃

ergodicthought retweeted

Elon Musk

@elonmusk

about 23 hours ago

Helpful tool for improvement. It’s just physics thinking in the limit.

32K

14K

Siva @ergodicthought

about 4 hours ago

@ArcanesValor @chamath WDYM the open models aren't on the efficient frontier of price/capability when they are priced at a pittance compared to the frontier models? Are you claiming that frontier-1 models from the big labs are better+cheaper than DeepSeek/etc?

Siva @ergodicthought

about 4 hours ago

Otherwise the claims are not even wrong, or to put it more scientifically: bullshit... so vague that it's impossible to reach agreement through empirical testing -- the kind that gets junked in scientific communication as unfalsifiable hypotheses.

Siva @ergodicthought

about 5 hours ago

There's a really easy way to settle it beyond all doubt. Folks just have to share session transcripts and the generated code for review. Post it online, show off your great work, and collect credit for it! Talk is cheap 🤷‍♂️

Perry E. Metzger

@perrymetzger

1 day ago

I have said this before, but to those of us using AI systems to get lots of work done reliably and quickly, the people who post online about how AIs still hallucinate constantly, about how they can’t write code, etc., seem equivalent to people trying to convince you that the car you drive to work every day doesn’t exist. You tell them things like “but I drive a car. I paid money for it. I buy gasoline for it. I could not possibly be working twenty miles away from home if I didn’t have the car?” and they reply that you are imagining having a car, or that you’re lying because you work for a car company. It is as though these people live in a completely different reality.

170

136

75K

Siva @ergodicthought

about 4 hours ago

@TanayLohia1 Look for someone who's done well in several unrelated things. And good at min one of math/phy/cs (critical thinking). Your challenge will be tempting/ inspiring them to come work with you instead of all the other things they could be doing instead.

Siva @ergodicthought

about 5 hours ago

@perrymetzger @AdrienLE True, but how can they trust your judgement without you/whoever actually sharing session transcripts and the generated outputs for review? Maybe the output is great, or maybe the user is just getting carried away and inflating grades 🤷‍♂️ there's an easy way to settle it, you know

Siva @ergodicthought

about 6 hours ago

@jsensarma @karthiks Interesting to think of interpolations. We could eg. demand that repeat victories recursively grow in size as a fraction of total vote/seat share.

Siva @ergodicthought

about 6 hours ago

@torchcompiled @ayirpelle You can also add closed form deformations. Shouldn't be surprising that if you use the exact score you'll overfit / get the exact data. It's the regularisation that gives generalization in ANY model.

109

Siva @ergodicthought

1 day ago

@isabelunraveled What does it mean to be angry but do it well?

Siva @ergodicthought

1 day ago

@EMostaque @jaminball I'm sure a lot of folks would love to read about it if you write something longform! Might have interesting takeaways for people developing different kinds of breakthrough technology into "platforms", and trying to thread the needle on a workable business model.

Siva @ergodicthought

2 days ago

@TheStalwart They're both certainly HUNGRIER in their patterns of token consumption. Is that superior, or is something more token efficient supposed to be superior?

Siva @ergodicthought

2 days ago

@azeem 1. I hope this isn't just a measurement of tokens from Open router/etc 2. Is this direct API tokens or does it include all product surfaces (eg Google search, NotebookLM, etc)

Siva

@ergodicthought

Who to follow

Last Seen Users on Sotwe

Trends for you

Most Popular Users