Alex Ratner

A research group in @StanfordAILab working on the foundations of machine learning & systems. https://t.co/JHK58TDorG Ostensibly supervised by Chris Ré

about 5 hours ago

Our thanks to everyone who dropped by yesterday for boba to learn from @EchoShao8899 of @stanfordnlp about "Collaborative Gym: A Framework for Enabling and Evaluating Human-Agent Collaboration". Key takeaway: across 3 tasks (travel planning, related-work writing, and tabular analysis), the best collaborative agents consistently outperformed fully autonomous ones when judged by real users. Recording/transcript ICYMI: https://t.co/jbI3KX5HSI

SnorkelAI's tweet photo. Our thanks to everyone who dropped by yesterday for boba to learn from @EchoShao8899 of @stanfordnlp about "Collaborative Gym: A Framework for Enabling and Evaluating Human-Agent Collaboration".

Key takeaway: across 3 tasks (travel planning, related-work writing, and tabular analysis), the best collaborative agents consistently outperformed fully autonomous ones when judged by real users.

Recording/transcript ICYMI: https://t.co/jbI3KX5HSI

0

9

3

1K

ajratner retweeted

vincent sunn chen

@vincentsunnchen

about 7 hours ago

ProgramBench up-levels evaluation to the artifact rather than purely measuring implementation. 1. It mirrors the interface that software users (e.g. engineers, researchers) are increasingly interacting with 2. This provides a new kind of "research tool" to study frontier models, including implementation trade-offs, fuzzing/validation, new interfaces for steering models Full discussion w/ @jyangballin below

0

8

7

1

1K

ajratner retweeted

Eric Glyman

@eglyman

about 11 hours ago

Today, Ramp raised $750M at a $44B valuation. Last time we grew this fast, we were 1/20th the size. For 2000 years, business was built on two pillars. Today, a third: intelligence. It’s your least governed cost. It’s also your single greatest opportunity.

eglyman's tweet photo. Today, Ramp raised $750M at a $44B valuation.

Last time we grew this fast, we were 1/20th the size.

For 2000 years, business was built on two pillars. Today, a third: intelligence.

It’s your least governed cost. It’s also your single greatest opportunity. https://t.co/D7ZtqoJN6K

111

2K

143

849

477K

Who to follow

hazyresearch

@HazyResearch

Greg Yang

@TheGregYang

xai cofounder. fighting lyme

Tri Dao

@tri_dao

Asst. Prof @PrincetonCS, Chief Scientist @togethercompute. Machine learning & systems.

1 day ago

Check out new @SnorkelAI Benchtalks with @jyangballin , author of ProgramBench, SWEBench, and many other key benchmarks in the space. While I'm crestfallen that @vincentsunnchen so quickly dropped the gag of having the interviews on a literal bench... this is a great one!!

vincent sunn chen

@vincentsunnchen

1 day ago

New Benchtalks with @jyangballin: on ProgramBench (0% frontier models at launch) and the lineage/future of coding benchmarks, from SWE-bench/InterCode to now 01:29 ProgramBench launch and reception 03:41 Why artifact-level evaluation, not code-level 06:03 Why models love Python 08:29 ProgramBench as a research tool 12:45 From SWE-bench & InterCode to ProgramBench 17:47 How to grade a coding model 21:53 The position paper & humans in the loop 25:01 Managing quality with agents-in-the-loop 28:40 Internet access and benchmark integrity 35:26 Where models may surpass human abilities 38:56 When a model hits 80% on ProgramBench 43:55 Benchmarks worth paying attention to 46:24 What benchmark do you wish existed 49:32 Will benchmarks still look like benchmarks in 5 years 52:02 How to contribute to ProgramBench

1

43

15

17K

1

7

0

3

1K

ajratner retweeted

2 days ago

Benchtalks #2: @jyangballin, creator of SWE-bench, fresh off ProgramBench (every frontier model: 0% at launch). Out soon with @vincentsunnchen.

SnorkelAI's tweet photo. Benchtalks #2: @jyangballin, creator of SWE-bench, fresh off ProgramBench (every frontier model: 0% at launch). Out soon with @vincentsunnchen. https://t.co/lCXdWhroHU

3

26

6

5

4K

7 days ago

So excited for this bold vision of small, personalized, on-device AI that everyone can own themselves! Excited to support the multi-objective evaluations that chart this new quality/latency/efficiency/cost/privacy Pareto frontier via @SnorkelAI Open Benchmarks Grants.

Jon Saad-Falcon

@JonSaadFalcon

7 days ago

The dominant story in AI has been the growing cloud: bigger clusters, larger models, more gigawatts. We believe the future is in the opposite direction: on-device inference, smaller models, watts instead of gigawatts. Today we're releasing @OpenJarvisAI v1.0: a personal AI assistant that lives, learns, and works on your device.

49

596

91

566

144K

1

17

3

1

3K

ajratner retweeted

7 days ago

Huge congrats to @jonsaadfalcon, @Avanika15, @Azaliamirh and the @HazyResearch team on @OpenJarvisAI — out today. For two years, they've been making the case that AI inference belongs on hardware people already own, not just in megawatt data centers. Excited to support the Intelligence per Watt line of work. Read more on their blog: https://t.co/6XYJFICGj3

SnorkelAI's tweet photo. Huge congrats to @jonsaadfalcon, @Avanika15, @Azaliamirh and the @HazyResearch team on @OpenJarvisAI — out today.

For two years, they've been making the case that AI inference belongs on hardware people already own, not just in megawatt data centers. Excited to support the Intelligence per Watt line of work.

Read more on their blog: https://t.co/6XYJFICGj3

0

25

7

2

2K

ajratner retweeted

8 days ago

Great turnout in San Jose today for @chris_m_glaze's paper session on Benchmarking Agents in Insurance Underwriting Environments at @CAISconf. If you're at the conference, catch the team behind the paper today at the poster session from 5:15–6:45 p.m. at Carmel/Monterey. And come find us tomorrow night at the Day 2 Conference Reception (sponsored by Snorkel). Paper: https://t.co/4VFVq4Gjcc

SnorkelAI's tweet photo. Great turnout in San Jose today for @chris_m_glaze's paper session on Benchmarking Agents in Insurance Underwriting Environments at @CAISconf.

If you're at the conference, catch the team behind the paper today at the poster session from 5:15–6:45 p.m. at Carmel/Monterey. And come find us tomorrow night at the Day 2 Conference Reception (sponsored by Snorkel).

Paper: https://t.co/4VFVq4Gjcc

1

16

3

0

958

ajratner retweeted

Pierce Kelaita @PKelaita

8 days ago

Announcing JudgmentBench – a dataset we at @StanfordLaw liftlab developed along with @harvey and @SnorkelAI that evaluates frontier LLM work product. The dataset contains 30 real-world tasks crafted by Biglaw attorneys paired with >3000 rubric and preference expert annotations.

PKelaita's tweet photo. Announcing JudgmentBench – a dataset we at @StanfordLaw liftlab developed along with @harvey and @SnorkelAI that evaluates frontier LLM work product.

The dataset contains 30 real-world tasks crafted by Biglaw attorneys paired with >3000 rubric and preference expert annotations. https://t.co/ZXKlSSQQtM

4

86

16

99

14K

ajratner retweeted

Cognition @cognition

8 days ago

1/ We’ve raised over $1B at a $26B valuation, led by @Lux_Capital, @generalcatalyst, and @8vc. Our enterprise usage has grown >10x since the start of this year, and our run-rate revenue grew to $492 M. We launched Devin two years ago as the first AI software engineer. Since then, cloud agents have gone from niche to mainstream, and today they are the fastest growing way to create software.

cognition's tweet photo. 1/ We’ve raised over $1B at a $26B valuation, led by @Lux_Capital, @generalcatalyst, and @8vc.

Our enterprise usage has grown >10x since the start of this year, and our run-rate revenue grew to $492 M.

We launched Devin two years ago as the first AI software engineer. Since then, cloud agents have gone from niche to mainstream, and today they are the fastest growing way to create software.

165

2K

200

463

856K

ajratner retweeted

Harvey @harvey

9 days ago

We evaluated frontier models on LAB, our long-horizon legal agent benchmark. Three findings stood out: 1) Legal work is far from saturated by frontier models. 2) Model performance varies sharply by practice area. 3) Cost and latency rise at the frontier. Read more:

harvey's tweet photo. We evaluated frontier models on LAB, our long-horizon legal agent benchmark.

Three findings stood out:
1) Legal work is far from saturated by frontier models.
2) Model performance varies sharply by practice area.
3) Cost and latency rise at the frontier.

Read more: https://t.co/39aDou0jWc

5

75

10

56

18K

ajratner retweeted

vincent sunn chen

@vincentsunnchen

9 days ago

Trajectory-based error analysis points to levers for post-training and harness engineering! From the @harvey team: - Verify-and-revise correlates with the biggest score jump (+1.5). - "Fan-out" tool parallelism hurts (-0.5); potentially adds noise without direction - Grounding drafts against source evidence is +0.3, but only occurs in 19% of trajectories Excited for more behavior-level analysis over long-horizon agent evals - great example here from Legal Agent Benchmark (LAB)!

vincentsunnchen's tweet photo. Trajectory-based error analysis points to levers for post-training and harness engineering!

From the @harvey team:
- Verify-and-revise correlates with the biggest score jump (+1.5).
- "Fan-out" tool parallelism hurts (-0.5); potentially adds noise without direction
- Grounding drafts against source evidence is +0.3, but only occurs in 19% of trajectories

Excited for more behavior-level analysis over long-horizon agent evals - great example here from Legal Agent Benchmark (LAB)!

1

19

3

5

2K

ajratner retweeted

9 days ago

Initial LAB results from Harvey put a number on something we see across specialized AI work: under rigorous all-pass standards, frontier models complete fewer than 10% of long-horizon legal tasks, and no single model leads across practice areas. General capability isn't sufficient for high-stakes professional work. Closing that gap takes domain-grounded data, evaluation, and post-training, which is exactly the research we're excited to do with the Harvey team next.

3

29

9

14

5K

ajratner retweeted

Gabe Pereyra

@gabepereyra

9 days ago

https://t.co/sdxZJodpKB

9

147

17

177

128K

ajratner retweeted

14 days ago

Live from MLSys 2026! Thanks to everyone who joined @pham_derek's talk yesterday on RLVR in low-data, low-compute regimes and swung by our poster session. Paper: https://t.co/dZL8uyhn4I Around tonight? Unwind after the conference with drinks, swing suites, and the team behind the paper. Last chance to RSVP ⛳: https://t.co/cBsH6D9TEz @vincentsunnchen @ArminPCM @realjustinbauer

SnorkelAI's tweet photo. Live from MLSys 2026! Thanks to everyone who joined @pham_derek's talk yesterday on RLVR in low-data, low-compute regimes and swung by our poster session.

Paper: https://t.co/dZL8uyhn4I

Around tonight? Unwind after the conference with drinks, swing suites, and the team behind the paper. Last chance to RSVP ⛳: https://t.co/cBsH6D9TEz

@vincentsunnchen @ArminPCM @realjustinbauer

2

26

5

0

1K

15 days ago

Congratulations @ravirajjain @ravi_lsvp !!! They have been incredible partners to @SnorkelAI from day one, and at every stage after that. Well deserved recognition!!

Lightspeed @lightspeedvp

15 days ago

Congratulations to @ravi_lsvp, @ravirajjain, and @buckymoore on their recognition in the Seed 100 List! The Seed 100 List from @businessinsider highlights early-stage investors with a unique ability to scout the tech stars of tomorrow. Amid the AI boom, the competitiveness and speed of investors getting in before the “seed stage” as we know it have been reinforced. This is the Seed 100’s sixth year, and it is an honor to have 3 Lightspeed team members acknowledged on the list. Early-stage investing has been wired into our team’s DNA for over 26 years. And we are incredibly proud to have backed many teams from their Seed rounds and beyond. As Ravi puts it: "The founders Lightspeed backs don't extrapolate from the present; they derive from first principles and arrive at futures others haven't thought to look for.”

lightspeedvp's tweet photo. Congratulations to @ravi_lsvp, @ravirajjain, and @buckymoore on their recognition in the Seed 100 List!

The Seed 100 List from @businessinsider highlights early-stage investors with a unique ability to scout the tech stars of tomorrow. Amid the AI boom, the competitiveness and speed of investors getting in before the “seed stage” as we know it have been reinforced.

This is the Seed 100’s sixth year, and it is an honor to have 3 Lightspeed team members acknowledged on the list.

Early-stage investing has been wired into our team’s DNA for over 26 years. And we are incredibly proud to have backed many teams from their Seed rounds and beyond.

As Ravi puts it: "The founders Lightspeed backs don't extrapolate from the present; they derive from first principles and arrive at futures others haven't thought to look for.”

1

24

2

0

5K

1

12

0

2K