Yanan Long @YananLong - Twitter Profile

2 days ago

🚨 As AI models improve, many benchmarks are becoming saturated and losing their ability to distinguish between models. 🚨 Check out our new @icmlconf paper: “When AI Benchmarks Plateau: A Systematic Study of Benchmark Saturation”

evaluatingevals's tweet photo. 🚨 As AI models improve, many benchmarks are becoming saturated and losing their ability to distinguish between models. 🚨

Check out our new @icmlconf paper: “When AI Benchmarks Plateau: A Systematic Study of Benchmark Saturation” https://t.co/gMW1B8k1gg

2

37

11

16

8K

YananLong retweeted

EvalEval Coalition @evaluatingevals

about 1 month ago

🚀EvalEval is 2/2 accepted at @icmlconf 2026 🚀 1⃣ Who Evaluates AI's Social Impact? Mapping Coverage and Gaps in First and Third Party Evaluations 2⃣ When AI Benchmarks Plateau: A Systematic Study of Benchmark Saturation Details below 🧵

4

25

8

2K

YananLong retweeted

Avijit Ghosh

@evijit

about 1 month ago

AI evaluation is becoming its own compute bottleneck. We often talk about the cost of training frontier models, but the cost of evaluating them is starting to matter just as much, especially for agents, scientific ML systems, and training-in-the-loop benchmarks. In our new Evaluating Evaluations post, we look at how evals are crossing a threshold where cost changes who can participate. The Holistic Agent Leaderboard spent about $40K on 21,730 agent rollouts across 9 models and 9 benchmarks. A single GAIA run on a frontier model can cost $2,829 before caching. And once you care about reliability, repeated runs can multiply these costs many times over. This creates a real accountability problem. If only large labs can afford statistically credible evals, independent researchers, auditors, journalists, and public-interest organizations are left with partial visibility into frontier systems. The core issue is that benchmark design is changing. Static benchmarks could often be compressed aggressively while preserving rankings. Agent benchmarks are noisier and scaffold-sensitive. Training-in-the-loop benchmarks are expensive by construction. As evals move closer to real work, they also become harder to make cheap. Some takeaways: → Leaderboards should report cost alongside accuracy. → Reliability should not be treated as optional. → We need reusable eval artifacts! Shared documentation formats, such as Every Eval Ever, can help the field stop paying repeatedly for the same measurements. Read the full post: https://t.co/sArlZMkytF Thanks for the insights @LChoshen , Yifan Mai, and @cgeorgiaw🤗

evijit's tweet photo. AI evaluation is becoming its own compute bottleneck.

We often talk about the cost of training frontier models, but the cost of evaluating them is starting to matter just as much, especially for agents, scientific ML systems, and training-in-the-loop benchmarks.

In our new Evaluating Evaluations post, we look at how evals are crossing a threshold where cost changes who can participate. The Holistic Agent Leaderboard spent about $40K on 21,730 agent rollouts across 9 models and 9 benchmarks. A single GAIA run on a frontier model can cost $2,829 before caching. And once you care about reliability, repeated runs can multiply these costs many times over.

This creates a real accountability problem. If only large labs can afford statistically credible evals, independent researchers, auditors, journalists, and public-interest organizations are left with partial visibility into frontier systems.
The core issue is that benchmark design is changing. Static benchmarks could often be compressed aggressively while preserving rankings. Agent benchmarks are noisier and scaffold-sensitive. Training-in-the-loop benchmarks are expensive by construction. As evals move closer to real work, they also become harder to make cheap.

Some takeaways:

→ Leaderboards should report cost alongside accuracy.
→ Reliability should not be treated as optional.
→ We need reusable eval artifacts! Shared documentation formats, such as Every Eval Ever, can help the field stop paying repeatedly for the same measurements.

Read the full post: https://t.co/sArlZMkytF

Thanks for the insights @LChoshen , Yifan Mai, and @cgeorgiaw🤗

4

84

20

81

12K

YananLong retweeted

Avijit Ghosh

@evijit

3 months ago

So happy to see Every Eval Ever (@evaluatingevals) take off! This is a big vote of confidence, and we really hope that we, as a community of eval practitioners, can move towards open standards that unlock scientific rigor and reproducibility. Thanks @mercor_ai !

1

18

6

1

2K

Who to follow

Christina Chance

@christinachanc

CS PhD Student | CSS and Community- Grounded NLP @ UCLA

Anaelia (Elia) Ovalle

@ovalle_elia

Research scientist @ FAIR. Inclusive ML / AI ethics. 🌈 🏳️‍⚧️ 🇩🇴 ✊🏾. Opinions = mine

Myra Cheng

@chengmyra1

PhD candidate @StanfordNLP

YananLong retweeted

QueerInAI @QueerinAI

10 months ago

ALL ABOARD!! 📝📝 FINAL CALL for NeurIPS CfP 🚨🚨 Deadline on 14th August EOD AoE✨ If you are a queer researcher, or conduct research on topics of queerness, please submit your incredible research for the Queer in AI workshop!! 🏳️‍🌈🏳️‍⚧️

1

7

3

1

2K

YananLong retweeted

QueerInAI @QueerinAI

10 months ago

Note that the workshop is non-archival and we accept past research, or work you plan on submitting elsewhere. Also, we accept non-traditional submissions! For more info, check out https://t.co/RcaOpDOeHK

0

1

0

653

YananLong retweeted

Andrew Gelman et al. @StatModeling

11 months ago

loo R package 10 years! https://t.co/kLDliQLlml

0

37

9

31K

Yanan Long @YananLong

11 months ago

@KLdivergence It's probably a bit too hot...

0

51

Yanan Long @YananLong

11 months ago

This is happening today!! Sign up for the CRAFT session and/or social https://t.co/ItgFgMl8Uc ⬇️

Yanan Long @YananLong

12 months ago

What is the future of @ACMFacct? Join our #FAccT2025 CRAFT session, "Taking Stock at FAcct: Insights from Participatory Design," to discuss its role in activism, community building, & real-world impact. This is your chance to help co-create our collective vision. 🧵 (1/4)

YananLong's tweet photo. What is the future of @ACMFacct? Join our #FAccT2025 CRAFT session, "Taking Stock at FAcct: Insights from Participatory Design," to discuss its role in activism, community building, & real-world impact. This is your chance to help co-create our collective vision. 🧵 (1/4) https://t.co/dQ4i4HskyP

1

0

1

355

0

1

0

50

Yanan Long @YananLong

12 months ago

Come to our session on June 26 (Day 4) at 15:15 in the Amphitheatre (just before the closing townhall) Sign up here: https://t.co/ivYcjQApW5

Yanan Long @YananLong

12 months ago

What is the future of @ACMFacct? Join our #FAccT2025 CRAFT session, "Taking Stock at FAcct: Insights from Participatory Design," to discuss its role in activism, community building, & real-world impact. This is your chance to help co-create our collective vision. 🧵 (1/4)

1

0

1

355

0

2

0

82

Yanan Long @YananLong

12 months ago

Finally met my wonderful collaborators in person (!!) Sign up for session: https://t.co/ivYcjQApW5

Yanan Long @YananLong

12 months ago

This CRAFT session is the result with a wonderful team of collaborators: @_jansimson, Shiran Dudy

0

2

0

224

0

2

0

152

Yanan Long @YananLong

12 months ago

@soldni @allen_ai Do you play games at work?

1

0

175

Yanan Long @YananLong

12 months ago

This CRAFT session is the result with a wonderful team of collaborators: @_jansimson, Shiran Dudy

Yanan Long @YananLong

12 months ago

What is the future of @ACMFacct? Join our #FAccT2025 CRAFT session, "Taking Stock at FAcct: Insights from Participatory Design," to discuss its role in activism, community building, & real-world impact. This is your chance to help co-create our collective vision. 🧵 (1/4)

1

0

1

355

0

2

0

224

Yanan Long @YananLong

12 months ago

🗓️ Day 4 (June 26) 🕒 15:15 📍 Amphitheatre Please bring a laptop or other devices you can type and submit statements with! Then, continue the conversation over meze & drinks at 18:30 (location TBD). See you there! #FAccT2025 (4/4) https://t.co/nbGWsgiNW3

0

80

Yanan Long @YananLong

12 months ago

What is the future of @ACMFacct? Join our #FAccT2025 CRAFT session, "Taking Stock at FAcct: Insights from Participatory Design," to discuss its role in activism, community building, & real-world impact. This is your chance to help co-create our collective vision. 🧵 (1/4)

1

0

1

355

Yanan Long @YananLong

12 months ago

The conversation continues! The @UsePolis conversation will stay open after our session, allowing the discussion to evolve. Your input will directly contribute to a report for the FAccT community and beyond. (3/4)

1

0

40

YananLong retweeted

Tao Long @CHI2026 @taolongg

about 1 year ago

🇯🇵🌸 #CHI2025 is abt to start! so its time! all queers! all of u on the spectrum! 🏳️‍🌈🏳️‍⚧️ @queerinAI is hosting a breakfast social on day 3 🥯🍘 📅【apr30, 8am】 ☕️【merengue hawaii cafe】 signup: https://t.co/QeUG7gqW6d *limited seats, come fast or cry later lol :) see u then!

1

37

6

1

2K

Yanan Long @YananLong

about 1 year ago

@PranavVenkit Yes I can help - can you DM me?

0

1

0

102

Yanan Long

@YananLong

Who to follow

Last Seen Users on Sotwe

Trends for you

Most Popular Users