Robin Salimans @SalimansRobin - Twitter Profile

Robin Salimans

@SalimansRobin

4 days ago

Zapier’s AutomationBench making waves! Try out Fable 5 🔥

Claude

@claudeai

4 days ago

Fable 5 is state-of-the-art on nearly all tested benchmarks, with exceptional performance in software engineering, knowledge work, scientific research, and vision. The longer and more complex the task, the larger Fable 5’s lead over our other models.

claudeai's tweet photo. Fable 5 is state-of-the-art on nearly all tested benchmarks, with exceptional performance in software engineering, knowledge work, scientific research, and vision.

The longer and more complex the task, the larger Fable 5’s lead over our other models. https://t.co/DxgSu0KUxh

509

16K

2K

5M

0

26

Robin Salimans

@SalimansRobin

15 days ago

Great explanation of AutomationBench by @danielwshepard! 👏 Turned out to be a super useful benchmark!

Andrew Warner

@AndrewWarner

15 days ago

Opus 4.8 is doing what 4.7 refused to do. 4.7 refused tasks related to: • diversity hiring • finance • paychecks Said "too risky." @zapier tests every model by asking it to do a set of tasks and sees how many they get right. I asked the guy who runs their benchmark work to teach me what each model can do and where they fail. 4.8 does the most multi-task work well, but it's not the winner for every task.

4

17

2

9

7K

1

2

0

743

SalimansRobin retweeted

AMC @TweetAnnaMarie

16 days ago

AutomationBench tests how models perform on the trickiest, stickiest real-world workflows we know customers are actually trying to automate. 600 tasks, 6 domains, deterministic scoring. And today our scores are featured on @AnthropicAI's official launch scorecard.

TweetAnnaMarie's tweet photo. AutomationBench tests how models perform on the trickiest, stickiest real-world workflows we know customers are actually trying to automate. 600 tasks, 6 domains, deterministic scoring.

And today our scores are featured on @AnthropicAI's official launch scorecard. https://t.co/KwK3y75WUk

1

5

2

0

225

SalimansRobin retweeted

Lisan al Gaib

@scaling01

16 days ago

Opus 4.8 ranks #1 on AutomationBench AutomationBench measures whether an agent can complete a realistic end-to-end business workflow

scaling01's tweet photo. Opus 4.8 ranks #1 on AutomationBench

AutomationBench measures whether an agent can complete a realistic end-to-end business workflow https://t.co/pWWHTXPJLK

4

149

9

11

15K

Who to follow

C̵͚͙͝ã̴̼̳r̶͕̐ś̶̯̩ọ̵̊͘n̶̞̜̾2̶̫̄̍ͅ6̶̨̈́͘7̴̼̮̈̕

@carson_267

Ashwin Jayaprakash

@fanbyprinciple

Am a grateful cat which means that I get to say 'gato arigato' 😹 Interests: AI, Infosec & Tech that happens to be in season. https://t.co/p0wkBUBkFR

Fernando_Ulhoa

@Horizonfernando

🧠📚

SalimansRobin retweeted

Logan Kilpatrick

@OfficialLoganK

23 days ago

Gemini 3.5 Flash ranks #1 on Automation Bench (from Zapier), beating every other frontier model at a much lower cost

180

1K

61

123

135K

Robin Salimans

@SalimansRobin

about 1 month ago

FedEx be like: “we’ll deliver this sometime between the day you’re born and the day you die (maybe, we’ll see) please be home thank you”

0

35

Robin Salimans

@SalimansRobin

about 1 month ago

@Jordy_vD_ *changes prettier config* -> pnpm fmt -> done! :)

1

0

35

Robin Salimans

@SalimansRobin

about 1 month ago

@wadefoster @zapier Hands down my favorite product I've ever worked on! We've been cooking 🔥

0

3

0

51

Robin Salimans

@SalimansRobin

about 1 month ago

@ckor We've been cooking 🔥🔥

0

20

Robin Salimans

@SalimansRobin

about 1 month ago

Can't wait to show the world what we've been working on!

Wade Foster

@wadefoster

about 1 month ago

Today we open early access to @Zapier's next product We've been using it internally for months (>1,000,000 actions executed so far) and it's completely changed how we work. We're letting a small group of AI-forward teams try it first and give feedback. Who we’re looking for: - 10-100 person teams using AI daily, but need to do so more collaboratively - teams ready to build and ship automations through agents, not just chat - teams that already work in the open (call recordings, shared docs, public channels, etc) - teams obsessed with optimizing every aspect of their work Join the waitlist: https://t.co/TjsaSI6QdN

wadefoster's tweet photo. Today we open early access to @Zapier's next product

We've been using it internally for months (>1,000,000 actions executed so far) and it's completely changed how we work.

We're letting a small group of AI-forward teams try it first and give feedback.

Who we’re looking for:
- 10-100 person teams using AI daily, but need to do so more collaboratively
- teams ready to build and ship automations through agents, not just chat
- teams that already work in the open (call recordings, shared docs, public channels, etc)
- teams obsessed with optimizing every aspect of their work

Join the waitlist: https://t.co/TjsaSI6QdN

97

625

46

484

144K

0

50

SalimansRobin retweeted

Zapier

@zapier

about 2 months ago

GPT-5.5 just hit 12.9% on our AutomationBench leaderboard First model to break 10% When context is missing, most models stop. GPT-5.5 keeps checking emails, docs, and chats until it knows what to do

zapier's tweet photo. GPT-5.5 just hit 12.9% on our AutomationBench leaderboard

First model to break 10%

When context is missing, most models stop. GPT-5.5 keeps checking emails, docs, and chats until it knows what to do https://t.co/AKddNlDycb

1

18

4

5

3K

Robin Salimans

@SalimansRobin

about 2 months ago

It’s been super fun working on this with @danielwshepard! It’s truly a great benchmark. Also thanks @mikeknoop for helping us figure things out along the way and @PrimeIntellect for the great verifiers framework + Lab 🤝

Wade Foster

@wadefoster

about 2 months ago

We built an AI benchmark that measures real work. Today we're releasing it to everyone. AI evals tell you whether a model can do complex reasoning or generate code. Useful, but usually not the question our customers ask. They want to know: can this model find the right CRM record, send the right follow-up, and not break anything along the way? We went looking for a benchmark that tested that. Nobody had built one, so we did. @Zapier’s AutomationBench drops AI models into realistic business environments across six domains (Sales, Marketing, Ops, Support, Finance, HR) and checks whether the work actually got done. The tasks include live CRM data, inbox threads with ambiguous context, and multi-step tool chains where one wrong call cascades. Scoring is deterministic: either the right records were updated and the right messages were sent, or they weren't. It’s useful enough that we're releasing it publicly today. Open task set, open methodology, open leaderboard. Everyone should have access to this. No model has cracked 10%. Yet. Try it here: https://t.co/V7qHAGX7Ql

wadefoster's tweet photo. We built an AI benchmark that measures real work.

Today we're releasing it to everyone.

AI evals tell you whether a model can do complex reasoning or generate code. Useful, but usually not the question our customers ask. They want to know: can this model find the right CRM record, send the right follow-up, and not break anything along the way?

We went looking for a benchmark that tested that. Nobody had built one, so we did.

@Zapier’s AutomationBench drops AI models into realistic business environments across six domains (Sales, Marketing, Ops, Support, Finance, HR) and checks whether the work actually got done.

The tasks include live CRM data, inbox threads with ambiguous context, and multi-step tool chains where one wrong call cascades.

Scoring is deterministic: either the right records were updated and the right messages were sent, or they weren't.

It’s useful enough that we're releasing it publicly today. Open task set, open methodology, open leaderboard. Everyone should have access to this.

No model has cracked 10%. Yet.

Try it here: https://t.co/V7qHAGX7Ql

15

132

21

138

36K

0

2

1

0

141

SalimansRobin retweeted

Prime Intellect @PrimeIntellect

about 2 months ago

We're excited to support the release of @Zapier's AutomationBench on the Environments Hub, measuring frontier performance on real Zapier workflows. Across 6 domains, 47 tools, and 600 tasks, frontier models all score under 10%.

PrimeIntellect's tweet photo. We're excited to support the release of @Zapier's AutomationBench on the Environments Hub, measuring frontier performance on real Zapier workflows.

Across 6 domains, 47 tools, and 600 tasks, frontier models all score under 10%. https://t.co/wZEOlxsBd3

5

102

18

26

35K

SalimansRobin retweeted

Wade Foster

@wadefoster

about 2 months ago

We built an AI benchmark that measures real work. Today we're releasing it to everyone. AI evals tell you whether a model can do complex reasoning or generate code. Useful, but usually not the question our customers ask. They want to know: can this model find the right CRM record, send the right follow-up, and not break anything along the way? We went looking for a benchmark that tested that. Nobody had built one, so we did. @Zapier’s AutomationBench drops AI models into realistic business environments across six domains (Sales, Marketing, Ops, Support, Finance, HR) and checks whether the work actually got done. The tasks include live CRM data, inbox threads with ambiguous context, and multi-step tool chains where one wrong call cascades. Scoring is deterministic: either the right records were updated and the right messages were sent, or they weren't. It’s useful enough that we're releasing it publicly today. Open task set, open methodology, open leaderboard. Everyone should have access to this. No model has cracked 10%. Yet. Try it here: https://t.co/V7qHAGX7Ql

15

132

21

138

36K

Robin Salimans

@SalimansRobin

2 months ago

This might be Zapier’s most powerful launch to date; go tell your agent to use the Zapier SDK!

Wade Foster

@wadefoster

2 months ago

Today we open the Zapier SDK to everyone. If you're building with AI agents, this is for you. I've been using this for 2 months. It's totally changed how I do my job. You install it in your coding agent. Cursor, Claude Code, Codex, whatever you use. Now that agent has access to 8,000+ apps through @Zapier and can do anything those APIs can do. I think it’s the most powerful thing we’ve launched in years. Now in open beta. Just give this link right to your agent: https://t.co/k6arEyZMMU

101

1K

92

2K

174K

0

1

0

57

Robin Salimans

@SalimansRobin

5 months ago

this looks *way* too fun

near

@nearcyan

5 months ago

this is how i claude code now. it's fun!

396

10K

566

7K

1M

0

81

Robin Salimans

@SalimansRobin

5 months ago

@ty_todd1 @DSPyOSS great stuff!

1

0

185

SalimansRobin retweeted

Andrej Karpathy

@karpathy

over 1 year ago

Agency > Intelligence I had this intuitively wrong for decades, I think due to a pervasive cultural veneration of intelligence, various entertainment/media, obsession with IQ etc. Agency is significantly more powerful and significantly more scarce. Are you hiring for agency? Are we educating for agency? Are you acting as if you had 10X agency? Grok explanation is ~close: “Agency, as a personality trait, refers to an individual's capacity to take initiative, make decisions, and exert control over their actions and environment. It’s about being proactive rather than reactive—someone with high agency doesn’t just let life happen to them; they shape it. Think of it as a blend of self-efficacy, determination, and a sense of ownership over one’s path. People with strong agency tend to set goals and pursue them with confidence, even in the face of obstacles. They’re the type to say, “I’ll figure it out,” and then actually do it. On the flip side, someone low in agency might feel more like a passenger in their own life, waiting for external forces—like luck, other people, or circumstances—to dictate what happens next. It’s not quite the same as assertiveness or ambition, though it can overlap. Agency is quieter, more internal—it’s the belief that you *can* act, paired with the will to follow through. Psychologists often tie it to concepts like locus of control: high-agency folks lean toward an internal locus, feeling they steer their fate, while low-agency folks might lean external, seeing life as something that happens *to* them.”

2K

50K

9K

36K

11M

Robin Salimans

@SalimansRobin

6 months ago

Join us in a few mins! https://t.co/QQeQW0v3z1

Zapier

@zapier

7 months ago

If you build AI agents, don't miss this session. Next Tuesday, we're going inside the RL environments that make them work. Join Will Brown (@willccbb) from @PrimeIntellect and our own Robin Salimans as we show how we design spaces where agents improve safely & efficiently. Here's what we'll cover 👇 1. What are RL environments? A look at how we use controlled, instrumented settings to let agents safely learn from feedback. 2. Building RL environments at Zapier How we’re applying the Verifiers library and GEPA (Genetic-Pareto) optimization to evaluate and refine agent behavior on real automation tasks. 3. Lessons learned so far Early insights into what’s worked, what hasn’t, and how we’re evolving our approach to agent training and evaluation. Join us live, Tuesday December 2nd at 1 PM ET. Reserve a seat here: https://t.co/j2epeJic5v

zapier's tweet photo. If you build AI agents, don't miss this session.

Next Tuesday, we're going inside the RL environments that make them work.

Join Will Brown (@willccbb) from @PrimeIntellect and our own Robin Salimans as we show how we design spaces where agents improve safely & efficiently.

Here's what we'll cover 👇

1. What are RL environments?
A look at how we use controlled, instrumented settings to let agents safely learn from feedback.

2. Building RL environments at Zapier
How we’re applying the Verifiers library and GEPA (Genetic-Pareto) optimization to evaluate and refine agent behavior on real automation tasks.

3. Lessons learned so far
Early insights into what’s worked, what hasn’t, and how we’re evolving our approach to agent training and evaluation.

Join us live, Tuesday December 2nd at 1 PM ET.

Reserve a seat here: https://t.co/j2epeJic5v

11

12

3

4

3K

1

4

1

0

149

Robin Salimans

@SalimansRobin

7 months ago

@redtachyon @PrimeIntellect You had me there for a second

0

1

0

436

Robin Salimans

@SalimansRobin

Who to follow

Last Seen Users on Sotwe

Trends for you

Most Popular Users