🚨 As AI models improve, many benchmarks are becoming saturated and losing their ability to distinguish between models. 🚨
Check out our new @icmlconf paper: “When AI Benchmarks Plateau: A Systematic Study of Benchmark Saturation”
🚀EvalEval is 2/2 accepted at @icmlconf 2026 🚀
1⃣ Who Evaluates AI's Social Impact? Mapping Coverage and Gaps in First and Third Party Evaluations
2⃣ When AI Benchmarks Plateau: A Systematic Study of Benchmark Saturation
Details below 🧵
AI evaluation is becoming its own compute bottleneck.
We often talk about the cost of training frontier models, but the cost of evaluating them is starting to matter just as much, especially for agents, scientific ML systems, and training-in-the-loop benchmarks.
In our new Evaluating Evaluations post, we look at how evals are crossing a threshold where cost changes who can participate. The Holistic Agent Leaderboard spent about $40K on 21,730 agent rollouts across 9 models and 9 benchmarks. A single GAIA run on a frontier model can cost $2,829 before caching. And once you care about reliability, repeated runs can multiply these costs many times over.
This creates a real accountability problem. If only large labs can afford statistically credible evals, independent researchers, auditors, journalists, and public-interest organizations are left with partial visibility into frontier systems.
The core issue is that benchmark design is changing. Static benchmarks could often be compressed aggressively while preserving rankings. Agent benchmarks are noisier and scaffold-sensitive. Training-in-the-loop benchmarks are expensive by construction. As evals move closer to real work, they also become harder to make cheap.
Some takeaways:
→ Leaderboards should report cost alongside accuracy.
→ Reliability should not be treated as optional.
→ We need reusable eval artifacts! Shared documentation formats, such as Every Eval Ever, can help the field stop paying repeatedly for the same measurements.
Read the full post: https://t.co/sArlZMkytF
Thanks for the insights @LChoshen , Yifan Mai, and @cgeorgiaw🤗
So happy to see Every Eval Ever (@evaluatingevals) take off! This is a big vote of confidence, and we really hope that we, as a community of eval practitioners, can move towards open standards that unlock scientific rigor and reproducibility.
Thanks @mercor_ai !
ALL ABOARD!! 📝📝 FINAL CALL for NeurIPS CfP 🚨🚨 Deadline on 14th August EOD AoE✨
If you are a queer researcher, or conduct research on topics of queerness, please submit your incredible research for the Queer in AI workshop!! 🏳️🌈🏳️⚧️
Note that the workshop is non-archival and we accept past research, or work you plan on submitting elsewhere. Also, we accept non-traditional submissions!
For more info, check out https://t.co/RcaOpDOeHK
What is the future of @ACMFacct? Join our #FAccT2025 CRAFT session, "Taking Stock at FAcct: Insights from Participatory Design," to discuss its role in activism, community building, & real-world impact. This is your chance to help co-create our collective vision. 🧵 (1/4)
What is the future of @ACMFacct? Join our #FAccT2025 CRAFT session, "Taking Stock at FAcct: Insights from Participatory Design," to discuss its role in activism, community building, & real-world impact. This is your chance to help co-create our collective vision. 🧵 (1/4)
What is the future of @ACMFacct? Join our #FAccT2025 CRAFT session, "Taking Stock at FAcct: Insights from Participatory Design," to discuss its role in activism, community building, & real-world impact. This is your chance to help co-create our collective vision. 🧵 (1/4)
🗓️ Day 4 (June 26)
🕒 15:15
📍 Amphitheatre
Please bring a laptop or other devices you can type and submit statements with!
Then, continue the conversation over meze & drinks at 18:30 (location TBD). See you there! #FAccT2025 (4/4)
https://t.co/nbGWsgiNW3
What is the future of @ACMFacct? Join our #FAccT2025 CRAFT session, "Taking Stock at FAcct: Insights from Participatory Design," to discuss its role in activism, community building, & real-world impact. This is your chance to help co-create our collective vision. 🧵 (1/4)
The conversation continues! The @UsePolis conversation will stay open after our session, allowing the discussion to evolve. Your input will directly contribute to a report for the FAccT community and beyond. (3/4)
🇯🇵🌸 #CHI2025 is abt to start!
so its time! all queers! all of u on the spectrum! 🏳️🌈🏳️⚧️ @queerinAI is hosting a breakfast social on day 3 🥯🍘
📅【apr30, 8am】
☕️【merengue hawaii cafe】
signup: https://t.co/QeUG7gqW6d
*limited seats, come fast or cry later lol :) see u then!