Fred Sala @fredsala - Twitter Profile

fredsala retweeted

8 days ago

The dominant story in AI has been the growing cloud: bigger clusters, larger models, more gigawatts. We believe the future is in the opposite direction: on-device inference, smaller models, watts instead of gigawatts. Today we're releasing @OpenJarvisAI v1.0: a personal AI assistant that lives, learns, and works on your device.

49

596

91

566

144K

fredsala retweeted

Ndea @ndea

9 days ago

On the pod: "Recursive Program Synthesis" with @awsTO, Associate Professor at @WisconsinCS. How cold-emailing @SumitGulwani at Microsoft Research led to a novel research paper and inside Aws' vision to automatically synthesize the software stack for future quantum computers.

1

19

3

7

2K

fredsala retweeted

Sanmi Koyejo @sanmikoyejo

15 days ago

"AI for science" benchmarks today mostly test textbook recall. Terminal-Bench Science is a chance for scientists to practice writing that definition. Contribute a real workflow, and you find out exactly where today's best agents break on it. https://t.co/GZ28R5QIRn

0

26

6

13

3K

fredsala retweeted

Niloofar

@niloofar_mire

14 days ago

🧬New agentic AI-for-science benchmark: SMDD-Bench! Can frontier LLM agents actually do small-molecule drug design? Real medicinal chemistry — not single-turn QA, not toy property prediction. Long-horizon, multi-turn, tool-using, with strict oracle budgets. We release 502 agentic tasks across 5 real drug-design workflows (pharmacophore ID, scaffold hopping, lead optimization, fragment assembly, interaction point discovery), every one guaranteed-solvable via a hidden witness molecule. Agents get a Python sandbox, 8 Boltz2 calls, 15 ADMET-AI calls, no internet — and have to plan across dozens of turns to spend that budget wisely. Result: GPT-5.4 and Gemini 3.1 Pro are neck-and-neck at the top (40.2% vs 39.0%), Claude Sonnet 4.6 right behind at 38%. Open-source models trail meaningfully. Even the best agents fail >60% of the time. 🧵 below

niloofar_mire's tweet photo. 🧬New agentic AI-for-science benchmark: SMDD-Bench!

Can frontier LLM agents actually do small-molecule drug design? Real medicinal chemistry — not single-turn QA, not toy property prediction. Long-horizon, multi-turn, tool-using, with strict oracle budgets.

We release 502 agentic tasks across 5 real drug-design workflows (pharmacophore ID, scaffold hopping, lead optimization, fragment assembly, interaction point discovery), every one guaranteed-solvable via a hidden witness molecule. Agents get a Python sandbox, 8 Boltz2 calls, 15 ADMET-AI calls, no internet — and have to plan across dozens of turns to spend that budget wisely.

Result: GPT-5.4 and Gemini 3.1 Pro are neck-and-neck at the top (40.2% vs 39.0%), Claude Sonnet 4.6 right behind at 38%. Open-source models trail meaningfully. Even the best agents fail >60% of the time.

🧵 below

12

175

27

80

23K

Who to follow

Tri Dao

@tri_dao

Asst. Prof @PrincetonCS, Chief Scientist @togethercompute. Machine learning & systems.

hazyresearch

@HazyResearch

A research group in @StanfordAILab working on the foundations of machine learning & systems. https://t.co/JHK58TDorG Ostensibly supervised by Chris Ré

Stefano Ermon

@StefanoErmon

AI Prof @Stanford | CEO & Cofounder @_inception_ai | Co-inventor of DDIM, FlashAttention, DPO, GAIL, and score-based/diffusion models

fredsala retweeted

Pedram Hosseini

@PedramHosseini

18 days ago

Rubric-based LLM evals are everywhere. But how do you know your rubric is any good? RIFT names 8 ways rubrics quietly break + automated diagnostics to catch them, great work by @SnorkelAI. Had a lot of fun building a repo to implement it and rerun the experiments. 🛠️ Repo: https://t.co/s5sTPlWU4T 📄 Paper: https://t.co/fqMAuAvL0Q

6

16

3

8

2K

Fred Sala @fredsala

21 days ago

@yuqirose Awesome news! Congrats Rose.

0

1

0

102

fredsala retweeted

Dimitris Papailiopoulos

@DimitrisPapail

26 days ago

Ten years in academia and the best part has not been what many value most ie freedom to pursue your ideas. It’s experiencing your students grow and go on to incredible trajectories. What I’ve come to know about myself is that I value permanence, presence, and people. And for all the illusions that institutions, titles, awards etc offer, none at all come close to this: watching a human absorb, even in tiny amounts, the care and effort you’ve put into trying your best to just be there for them.

11

281

24

39

17K

fredsala retweeted

Chandan Singh @csinva

29 days ago

2 emerging interpretability trends I'm excited about from this paper: (1) agent-facing interp & (2) interp objectives for autoresearch 🧵

1

69

13

56

8K

Fred Sala @fredsala

30 days ago

@ziv_ravid If the ideas are strong enough, numbers are just a distraction.

1

15

0

4K

fredsala retweeted

Kelly Buchanan

@ekellbuch

30 days ago

Very excited to release Terminal-Bench 2.1! Coding agents are among the most economically consequential deployments of LLMs to date. As agents improve, benchmark reliability matters more. We audited TB2.0 and found and corrected issues in 28/89 tasks. 30% of the benchmark! But the rankings survived, absolute scores moved up to 12pp!

ekellbuch's tweet photo. Very excited to release Terminal-Bench 2.1!

Coding agents are among the most economically consequential deployments of LLMs to date. As agents improve, benchmark reliability matters more.

We audited TB2.0 and found and corrected issues in 28/89 tasks. 30% of the benchmark!

But the rankings survived, absolute scores moved up to 12pp!

28

762

74

219

85K

fredsala retweeted

terminalbench @terminalbench

30 days ago

We're releasing Terminal-Bench 2.1 to patch 28 of the 89 tasks in Terminal-Bench 2.0 TB2.1 includes • recalibrated limits • fixed solutions • realigned verifiers Per-task breakdowns in 🧵 We'll continue to support TB2 and TB2.1 leaderboards (new submission process 🔜)

terminalbench's tweet photo. We're releasing Terminal-Bench 2.1 to patch 28 of the 89 tasks in Terminal-Bench 2.0

TB2.1 includes

• recalibrated limits
• fixed solutions
• realigned verifiers

Per-task breakdowns in 🧵

We'll continue to support TB2 and TB2.1 leaderboards (new submission process 🔜) https://t.co/NeNUny3v9t

2

52

12

10

15K

fredsala retweeted

Dan Biderman

@dan_biderman

30 days ago

Legal AI is still far from solved. The breakthroughs needed will generalize to all knowledge work. A community is needed to get there, and Harvey is helping build it.

3

39

5

20

6K

fredsala retweeted

Gabe Pereyra

@gabepereyra

about 1 month ago

https://t.co/AWIhrxBD5c

28

373

52

533

682K

fredsala retweeted

Gavin Brown

@gavinrbrown1

about 1 month ago

If the NeurIPS paper checklist isn't a good motivation to do entirely theoretical work, I don't know what is.

3

241

5

46

26K

fredsala retweeted

Andy Konwinski

@andykonwinski

about 1 month ago

first benchmark that I know of that tests an agent's ability to improve as it tackles multiple tasks

1

24

3

7

4K

Fred Sala @fredsala

about 1 month ago

Really excited for @pgasawa and team! Continual learning desperately needs benchmarks that distinguish raw ability from actually improving with experience. Continual Learning Bench is a great step in this direction.

Parth Asawa

@pgasawa

about 1 month ago

Today, we’re releasing Continual Learning Bench 1.0: the first, realistic benchmark for measuring how AI systems can improve in online settings. Benchmarks today assume models are stateless. Each example is independent, and once a system finishes a task, it moves on as if nothing happened. But deployed AI systems should learn from experience. We tested 10+ frontier systems against novel, expert-validated tasks and find there’s still plenty of headroom for learning. (1/n)

pgasawa's tweet photo. Today, we’re releasing Continual Learning Bench 1.0: the first, realistic benchmark for measuring how AI systems can improve in online settings.

Benchmarks today assume models are stateless. Each example is independent, and once a system finishes a task, it moves on as if nothing happened.

But deployed AI systems should learn from experience. We tested 10+ frontier systems against novel, expert-validated tasks and find there’s still plenty of headroom for learning. (1/n)

42

1K

156

900

829K

1

22

3

2

1K

Fred Sala @fredsala

about 1 month ago

@EarlenceF Congratulations Earlence! Great news.

0

1

0

73

Fred Sala @fredsala

about 1 month ago

@ArminPCM @SnorkelAI Thank you Armin :)

0

55

Fred Sala @fredsala

about 1 month ago

- CARE: Confounder-Aware Aggregation for Reliable LLM Evaluation (https://t.co/cQeV38oPLv) - Evaluating Sample Utility for Efficient Data Selection by Mimicking Model Weights (https://t.co/6dFtul9KoP)

0

10

1

0

744

Fred Sala @fredsala

about 1 month ago

Excited to share we have 4 papers accepted to ICML 2026, including one spotlight. Proud of the students and collaborators, and looking forward to sharing more about these directions! More on each coming soon---check out our work:

fredsala's tweet photo. Excited to share we have 4 papers accepted to ICML 2026, including one spotlight.

Proud of the students and collaborators, and looking forward to sharing more about these directions!

More on each coming soon---check out our work: https://t.co/CKKt0Oxz9O

3

83

20

9

7K

Fred Sala @fredsala

about 1 month ago

- Expressivity-Efficiency Tradeoffs for Hybrid Sequence Models (https://t.co/qy21p8I2jS) - Weight Updates as Activation Shifts: A Principled Framework for Steering (https://t.co/mL6Z7GEW7M)

1

11

1

551

Fred Sala

@fredsala

Who to follow

Last Seen Users on Sotwe

Trends for you

Most Popular Users