EvalEval Coalition

@evaluatingevals

We are a researcher community developing scientifically grounded research outputs and robust deployment infrastructure for broader impact evaluations.

Joined June 2025

7 Following

481 Followers

90 Posts

Pinned Tweet

EvalEval Coalition @evaluatingevals

4 months ago

🚀 Launching Every Eval Ever: Toward a Common Language for AI Eval Reporting 🚀 A shared schema + crowdsourced repository so we can finally compare evals across frameworks and stop rerunning everything from scratch 🔧 A tale of broken AI evals 🧵👇 https://t.co/UOcxiMwKOL

2

47

16

19

13K

EvalEval Coalition @evaluatingevals

2 days ago

@icmlconf At EvalEval, we’re working toward more reliable, informative, and sustainable AI evaluation practices. Interested in helping shape the future of AI evaluation? Learn more about EvalEval: https://t.co/7eyTLMpo8I

0

8

2

1

796

EvalEval Coalition @evaluatingevals

2 days ago

⚠️ Key Takeaways: - Nearly half of the benchmarks we studied show high or very high saturation, meaning they struggle to distinguish between models. - Contrary to common assumptions, private test sets do not appear to prevent saturation once benchmarks become widely adopted.

2

6

0

1

183

EvalEval Coalition @evaluatingevals

2 days ago

Read the paper! See you at @icmlconf: https://t.co/qbQkbMQMzX

1

5

0

1

194

EvalEval Coalition @evaluatingevals

2 days ago

- Studied 60 widely-used LLM benchmarks spanning reasoning, knowledge, coding, multilingual evaluation, and more, - Annotated benchmarks across 14 properties (e.g., age, accessibility, language) to study how benchmark design relates to long-term resilience against saturation.

1

4

0

0

295

EvalEval Coalition @evaluatingevals

2 days ago

🚨 As AI models improve, many benchmarks are becoming saturated and losing their ability to distinguish between models. 🚨 Check out our new @icmlconf paper: “When AI Benchmarks Plateau: A Systematic Study of Benchmark Saturation”

evaluatingevals's tweet photo. 🚨 As AI models improve, many benchmarks are becoming saturated and losing their ability to distinguish between models. 🚨

Check out our new @icmlconf paper: “When AI Benchmarks Plateau: A Systematic Study of Benchmark Saturation” https://t.co/gMW1B8k1gg

2

37

11

16

8K

EvalEval Coalition @evaluatingevals

2 days ago

🎯 We aim to understand which benchmarks are saturating, why it happens, and what makes some benchmarks remain useful longer than others. - We introduced a new uncertainty-aware saturation index that measures when top-performing models become statistically indistinguishable,

1

5

0

0

341

EvalEval Coalition @evaluatingevals

about 1 month ago

Amazing effort by all the authors, working chairs and coalition members! If you're going to be at ICML, come say Hi! Stay tuned for updates on ICML socials 👋

0

4

1

0

212

EvalEval Coalition @evaluatingevals

about 1 month ago

🚀EvalEval is 2/2 accepted at @icmlconf 2026 🚀 1⃣ Who Evaluates AI's Social Impact? Mapping Coverage and Gaps in First and Third Party Evaluations 2⃣ When AI Benchmarks Plateau: A Systematic Study of Benchmark Saturation Details below 🧵

4

25

8

8

2K

EvalEval Coalition @evaluatingevals

about 1 month ago

Working groups led by @evijit, @AnkaReuel, @akhtarmubashara and Jenny Chim! Preprints: Social Impacts -> https://t.co/immzuJGQpe Benchmark Saturation -> https://t.co/HSmu33Pgjq

1

4

1

0

203

EvalEval Coalition @evaluatingevals

about 1 month ago

New blog post on cost of AI Evals! Check it out 👇

about 1 month ago

AI evaluation is becoming its own compute bottleneck. We often talk about the cost of training frontier models, but the cost of evaluating them is starting to matter just as much, especially for agents, scientific ML systems, and training-in-the-loop benchmarks. In our new Evaluating Evaluations post, we look at how evals are crossing a threshold where cost changes who can participate. The Holistic Agent Leaderboard spent about $40K on 21,730 agent rollouts across 9 models and 9 benchmarks. A single GAIA run on a frontier model can cost $2,829 before caching. And once you care about reliability, repeated runs can multiply these costs many times over. This creates a real accountability problem. If only large labs can afford statistically credible evals, independent researchers, auditors, journalists, and public-interest organizations are left with partial visibility into frontier systems. The core issue is that benchmark design is changing. Static benchmarks could often be compressed aggressively while preserving rankings. Agent benchmarks are noisier and scaffold-sensitive. Training-in-the-loop benchmarks are expensive by construction. As evals move closer to real work, they also become harder to make cheap. Some takeaways: → Leaderboards should report cost alongside accuracy. → Reliability should not be treated as optional. → We need reusable eval artifacts! Shared documentation formats, such as Every Eval Ever, can help the field stop paying repeatedly for the same measurements. Read the full post: https://t.co/sArlZMkytF Thanks for the insights @LChoshen , Yifan Mai, and @cgeorgiaw🤗

evijit's tweet photo. AI evaluation is becoming its own compute bottleneck.

We often talk about the cost of training frontier models, but the cost of evaluating them is starting to matter just as much, especially for agents, scientific ML systems, and training-in-the-loop benchmarks.

In our new Evaluating Evaluations post, we look at how evals are crossing a threshold where cost changes who can participate. The Holistic Agent Leaderboard spent about $40K on 21,730 agent rollouts across 9 models and 9 benchmarks. A single GAIA run on a frontier model can cost $2,829 before caching. And once you care about reliability, repeated runs can multiply these costs many times over.

This creates a real accountability problem. If only large labs can afford statistically credible evals, independent researchers, auditors, journalists, and public-interest organizations are left with partial visibility into frontier systems.
The core issue is that benchmark design is changing. Static benchmarks could often be compressed aggressively while preserving rankings. Agent benchmarks are noisier and scaffold-sensitive. Training-in-the-loop benchmarks are expensive by construction. As evals move closer to real work, they also become harder to make cheap.

Some takeaways:

→ Leaderboards should report cost alongside accuracy.
→ Reliability should not be treated as optional.
→ We need reusable eval artifacts! Shared documentation formats, such as Every Eval Ever, can help the field stop paying repeatedly for the same measurements.

Read the full post: https://t.co/sArlZMkytF

Thanks for the insights @LChoshen , Yifan Mai, and @cgeorgiaw🤗

evijit's tweet photo. AI evaluation is becoming its own compute bottleneck.

We often talk about the cost of training frontier models, but the cost of evaluating them is starting to matter just as much, especially for agents, scientific ML systems, and training-in-the-loop benchmarks.

In our new Evaluating Evaluations post, we look at how evals are crossing a threshold where cost changes who can participate. The Holistic Agent Leaderboard spent about $40K on 21,730 agent rollouts across 9 models and 9 benchmarks. A single GAIA run on a frontier model can cost $2,829 before caching. And once you care about reliability, repeated runs can multiply these costs many times over.

This creates a real accountability problem. If only large labs can afford statistically credible evals, independent researchers, auditors, journalists, and public-interest organizations are left with partial visibility into frontier systems.
The core issue is that benchmark design is changing. Static benchmarks could often be compressed aggressively while preserving rankings. Agent benchmarks are noisier and scaffold-sensitive. Training-in-the-loop benchmarks are expensive by construction. As evals move closer to real work, they also become harder to make cheap.

Some takeaways:

→ Leaderboards should report cost alongside accuracy.
→ Reliability should not be treated as optional.
→ We need reusable eval artifacts! Shared documentation formats, such as Every Eval Ever, can help the field stop paying repeatedly for the same measurements.

Read the full post: https://t.co/sArlZMkytF

Thanks for the insights @LChoshen , Yifan Mai, and @cgeorgiaw🤗

evijit's tweet photo. AI evaluation is becoming its own compute bottleneck.

We often talk about the cost of training frontier models, but the cost of evaluating them is starting to matter just as much, especially for agents, scientific ML systems, and training-in-the-loop benchmarks.

In our new Evaluating Evaluations post, we look at how evals are crossing a threshold where cost changes who can participate. The Holistic Agent Leaderboard spent about $40K on 21,730 agent rollouts across 9 models and 9 benchmarks. A single GAIA run on a frontier model can cost $2,829 before caching. And once you care about reliability, repeated runs can multiply these costs many times over.

This creates a real accountability problem. If only large labs can afford statistically credible evals, independent researchers, auditors, journalists, and public-interest organizations are left with partial visibility into frontier systems.
The core issue is that benchmark design is changing. Static benchmarks could often be compressed aggressively while preserving rankings. Agent benchmarks are noisier and scaffold-sensitive. Training-in-the-loop benchmarks are expensive by construction. As evals move closer to real work, they also become harder to make cheap.

Some takeaways:

→ Leaderboards should report cost alongside accuracy.
→ Reliability should not be treated as optional.
→ We need reusable eval artifacts! Shared documentation formats, such as Every Eval Ever, can help the field stop paying repeatedly for the same measurements.

Read the full post: https://t.co/sArlZMkytF

Thanks for the insights @LChoshen , Yifan Mai, and @cgeorgiaw🤗

evijit's tweet photo. AI evaluation is becoming its own compute bottleneck.

We often talk about the cost of training frontier models, but the cost of evaluating them is starting to matter just as much, especially for agents, scientific ML systems, and training-in-the-loop benchmarks.

In our new Evaluating Evaluations post, we look at how evals are crossing a threshold where cost changes who can participate. The Holistic Agent Leaderboard spent about $40K on 21,730 agent rollouts across 9 models and 9 benchmarks. A single GAIA run on a frontier model can cost $2,829 before caching. And once you care about reliability, repeated runs can multiply these costs many times over.

This creates a real accountability problem. If only large labs can afford statistically credible evals, independent researchers, auditors, journalists, and public-interest organizations are left with partial visibility into frontier systems.
The core issue is that benchmark design is changing. Static benchmarks could often be compressed aggressively while preserving rankings. Agent benchmarks are noisier and scaffold-sensitive. Training-in-the-loop benchmarks are expensive by construction. As evals move closer to real work, they also become harder to make cheap.

Some takeaways:

→ Leaderboards should report cost alongside accuracy.
→ Reliability should not be treated as optional.
→ We need reusable eval artifacts! Shared documentation formats, such as Every Eval Ever, can help the field stop paying repeatedly for the same measurements.

Read the full post: https://t.co/sArlZMkytF

Thanks for the insights @LChoshen , Yifan Mai, and @cgeorgiaw🤗

4

84

20

81

12K

0

10

1

3

981

EvalEval Coalition @evaluatingevals

3 months ago

3 days left! 📷 Writing, wrote, or just submitted a paper? Commit it to the EvalEval workshop at ACL 2026 in San Diego! https://t.co/JRSr50UA8y (including ARR Submissions, non-archival, positions, and extended abstracts!) Submission Deadline: March 19th, 2026 AoE

0

10

6

8

4K

EvalEval Coalition @evaluatingevals

3 months ago

📄 Submission Link: https://t.co/Rr3rC00Kpe 🔗 Workshop Website: https://t.co/JRSr50V7Y6 See you in San Diego! 🏖️ (2/2)

0

0

0

1

224

EvalEval Coalition @evaluatingevals

3 months ago

⏳ 9 more days! We extended the submission deadline for the EvalEval Workshop @ ACL 2026. If your work touches AI evaluation, submit! We welcome: ✅ Regular papers ✅ ARR submissions ✅ Non-archival work ✅ Position papers ✅ Extended abstracts 📅 Deadline: March 19 (1/2)

1

10

1

1

872

EvalEval Coalition @evaluatingevals

3 months ago

Sitting on results from papers or leaderboards? Whether you use lm-eval, Inspect AI, or HELM, we have low-lift converters ready to go. 🦾 💾 GitHub: https://t.co/CydLGwhVGb 📜 Co-authorship on the shared task paper for qualifying contributors 📅 Deadline: May 1, 2026

0

10

0

3

646

EvalEval Coalition @evaluatingevals

3 months ago

🧪 Your LLM evaluation results could help the whole field 🚀 🧑‍🔬 Our ACL Shared task is out! We’re building a unified, crowdsourced database to create a common language for AI evaluation reporting. And we need your data. (1/2) https://t.co/SQhEVsqEWg

1

60

8

37

18K

EvalEval Coalition @evaluatingevals

4 months ago

Read the full announcement: https://t.co/0aRHJ0t92x Shared Task: https://t.co/SQhEVsqEWg Project Webpage: https://t.co/UOcxiMwcZd #AIEvaluation #EvalEval

0

8

1

3

421

EvalEval Coalition @evaluatingevals

4 months ago

🚀 Launching Every Eval Ever: Toward a Common Language for AI Eval Reporting 🚀 A shared schema + crowdsourced repository so we can finally compare evals across frameworks and stop rerunning everything from scratch 🔧 A tale of broken AI evals 🧵👇 https://t.co/UOcxiMwKOL

2

47

16

19

13K

EvalEval Coalition @evaluatingevals

4 months ago

Thankful to our partners for the feedback: CAISI, @AiEleuther, @huggingface, @NomaSecurity, @TrustibleAI, InspectAI, Meridian, AVERI, CIP, Stanford HELM, Weizenbaum, Evidence Prime, MIT, TUM, IBM Research 🤝

1

8

0

2

658

Last Seen Users on Sotwe

Trends for you

Most Popular Users