Daniel Shepard @danielwshepard - Twitter Profile

Pinned Tweet

about 2 months ago

The last few months I have been working on a new Benchmark. Introducing AutomationBench. Trying to measure the cutting edge of model's capabilities in real world business workflows across multiple apps and noisy data. The best models haven't beat 10% yet.

2

4

0

644

Daniel Shepard @danielwshepard

2 days ago

@AndrewWarner @briandecoded @jspujji Good question!😄 I did experiment with this on myself to validate many of the benchmark tasks. It is very time consuming. Never did make it though ~600.😅 The public dataset is available for anyone who wants to give it a shot though! https://t.co/2kgLyzJHU8

1

0

16

danielwshepard retweeted

Mike Knoop

@mikeknoop

2 days ago

Zapier AutomationBench being used to report Tool Use performance on Fable 5's model card

0

8

1

0

1K

Daniel Shepard @danielwshepard

2 days ago

@wadefoster @zapier From Fable's System and Model Cards:

0

1

0

43

Daniel Shepard @danielwshepard

2 days ago

From Fable's system and model cards:

0

33

Daniel Shepard @danielwshepard

2 days ago

Fable 5 seems better than Opus in every way. Like Opus is to Sonnet. It works smarter rather than harder. Cost is 2x Opus but cost per task was only 17% more on max reasoning! Fable is much more efficient with tokens than other models. ~1/2 the cost of GPT 5.5 xhigh.

Wade Foster

@wadefoster

2 days ago

Claude Fable is here: the first model in their new Mythos series. It's the new top score on @Zapier's AutomationBench at 17.4%, just two weeks after Opus 4.8 set the record at 15.5%. Our AutomationBench measures what enterprises actually care about: can a model do the work? Find the right CRM record, send the right follow-up, update the right system without breaking anything? We tested 600 tasks across 6 domains. Here’s what we saw: Fable knows when to work smarter instead of harder. That means fewer timeouts and fewer wasted tokens in production. EXAMPLE: One task asked the model to reconcile employee benefits across countries. The HR system's benefit-plans endpoint returned a 404. Fable hit it once, immediately pivoted to the team's spreadsheet and inbox, found the plan data there, and finished the task. Meanwhile, Opus moved on and missed a key detail. That's the Fable pattern. It follows complex instructions precisely (especially the "leave these ones alone" kind), and when it hits a dead end, it goes looking somewhere else instead of spinning its wheels and wasting tokens. PRICING: You may have seen that Fable is 2x the price of Opus. But that's the model rate, not the task cost. In Zapier, Fable came in at $3.67 per task at max effort, only 17% more than Opus 4.8 max at $3.14. tl;dr: Who should immediately upgrade their workflows from @claudeai's Opus to Fable? - Operations & HR - Long Horizon Tasks needing reliability and autonomy - Any workflows where precision + accuracy matter more than cost

wadefoster's tweet photo. Claude Fable is here: the first model in their new Mythos series.

It's the new top score on @Zapier's AutomationBench at 17.4%, just two weeks after Opus 4.8 set the record at 15.5%.

Our AutomationBench measures what enterprises actually care about: can a model do the work? Find the right CRM record, send the right follow-up, update the right system without breaking anything?

We tested 600 tasks across 6 domains. Here’s what we saw:

Fable knows when to work smarter instead of harder. That means fewer timeouts and fewer wasted tokens in production.

EXAMPLE: One task asked the model to reconcile employee benefits across countries. The HR system's benefit-plans endpoint returned a 404. Fable hit it once, immediately pivoted to the team's spreadsheet and inbox, found the plan data there, and finished the task. Meanwhile, Opus moved on and missed a key detail.

That's the Fable pattern. It follows complex instructions precisely (especially the "leave these ones alone" kind), and when it hits a dead end, it goes looking somewhere else instead of spinning its wheels and wasting tokens.

PRICING: You may have seen that Fable is 2x the price of Opus. But that's the model rate, not the task cost. In Zapier, Fable came in at $3.67 per task at max effort, only 17% more than Opus 4.8 max at $3.14.

tl;dr:

Who should immediately upgrade their workflows from @claudeai's Opus to Fable?

- Operations & HR
- Long Horizon Tasks needing reliability and autonomy
- Any workflows where precision + accuracy matter more than cost

2

16

3

4K

1

20

3

2K

Daniel Shepard @danielwshepard

2 days ago

Our CEO Wade talks AutomationBench with examples.

Andrew Warner

@AndrewWarner

2 days ago

🚨 Anthropic released Claude Fable 5 It's Mythos, but safe. The BIG question: Is it dependable enough to use apps to grow your business? @wadefoster's team at @zapier ran it through 600+ real-world business uses. Key results: 1. It stays on track - if you ask it about a specific topic in a specific Slack channel, it won't merge data in from other channels and topics. 2. It's the most resourceful - They told it to get HR data from an API that was down. It quickly switched from using the failed API to searching email & spreadsheets. (GPT 5.5 hit the down API 22 times!) 3. It routes intelligently - They asked it to take leads from multiple sources and send each to the right salesperson. It kills at operational tasks like that. BUT: 1. For sales and marketing tasks, GPT 5.5 is still more dependable. 2. Fable is crazy expensive ($3.67/task vs $0.87 for Gemini 3.5 Flash) If you love numbers (like me) the AutomationBenchmark leaderboard is below.

8

19

4

16

8K

0

1

0

38

Daniel Shepard @danielwshepard

2 days ago

Fable 5 is out. AutomationBench made the model card! (under Tool use)

Claude

@claudeai

2 days ago

Fable 5 is state-of-the-art on nearly all tested benchmarks, with exceptional performance in software engineering, knowledge work, scientific research, and vision. The longer and more complex the task, the larger Fable 5’s lead over our other models.

claudeai's tweet photo. Fable 5 is state-of-the-art on nearly all tested benchmarks, with exceptional performance in software engineering, knowledge work, scientific research, and vision.

The longer and more complex the task, the larger Fable 5’s lead over our other models. https://t.co/DxgSu0KUxh

499

15K

2K

5M

0

1

0

174

Daniel Shepard @danielwshepard

6 days ago

@wadefoster But each of these that do support reasoning effort was run at the highest available effort.

0

15

Daniel Shepard @danielwshepard

6 days ago

@wadefoster Today learned that some of these models like Kimi K2.6 do not actually accept Reasoning Effort. The provider we used for running these handles that behind the scenes. So Max and High being different was just run to run variance rather than actual reasoning effort.

0

29

Daniel Shepard @danielwshepard

9 days ago

This is a great talk on benchmarks! A good overview, some popular benchmarks, all the variables that can change results, and things to watch out for.

Florian Brand

@xeophon

10 days ago

The talk is now on YouTube! Link: https://t.co/dY25kEuIUn

1

88

8

54

16K

0

2

0

31

Daniel Shepard @danielwshepard

13 days ago

@AndrewWarner @zapier Thanks for having me on! More info on AutomationBench: https://t.co/BZrEEOQciw Github Repo where anyone can run it themselves: https://t.co/j7ZbLF45cT White Paper: https://t.co/RVbSBYTYrg

1

5

0

105

Daniel Shepard @danielwshepard

13 days ago

Talked with Andrew Warner on what AutomationBench measures and Opus 4.8!

Andrew Warner

@AndrewWarner

13 days ago

Opus 4.8 is doing what 4.7 refused to do. 4.7 refused tasks related to: • diversity hiring • finance • paychecks Said "too risky." @zapier tests every model by asking it to do a set of tasks and sees how many they get right. I asked the guy who runs their benchmark work to teach me what each model can do and where they fail. 4.8 does the most multi-task work well, but it's not the winner for every task.

4

17

2

9

7K

1

9

2

2K

Daniel Shepard @danielwshepard

14 days ago

@wadefoster @AnthropicAI

0

31

Daniel Shepard @danielwshepard

14 days ago

@zapier Gemini Flash was cheaper per task, but Opus is a very efficient model for tokens. Gemini 3.5 Flash hardly called tools in parallel whereas Opus did more often than not. This meant Opus needed half the number of input tokens and steps.

danielwshepard's tweet photo. @zapier Gemini Flash was cheaper per task, but Opus is a very efficient model for tokens. Gemini 3.5 Flash hardly called tools in parallel whereas Opus did more often than not. This meant Opus needed half the number of input tokens and steps. https://t.co/iyiOE9Zh9V

1

0

67

Daniel Shepard @danielwshepard

14 days ago

@wadefoster @AnthropicAI Gemini Flash was cheaper per task, but Opus is a very efficient model for tokens. Gemini 3.5 Flash hardly called tools in parallel whereas Opus did more often than not. This meant Opus needed half the number of input tokens and steps.

danielwshepard's tweet photo. @wadefoster @AnthropicAI Gemini Flash was cheaper per task, but Opus is a very efficient model for tokens. Gemini 3.5 Flash hardly called tools in parallel whereas Opus did more often than not. This meant Opus needed half the number of input tokens and steps. https://t.co/uBwtqlLSYs

0

1

0

79

Daniel Shepard @danielwshepard

14 days ago

@zapier Here is the lineup with cost per task:

0

1

0

44

danielwshepard retweeted

AMC @TweetAnnaMarie

14 days ago

AutomationBench tests how models perform on the trickiest, stickiest real-world workflows we know customers are actually trying to automate. 600 tasks, 6 domains, deterministic scoring. And today our scores are featured on @AnthropicAI's official launch scorecard.

TweetAnnaMarie's tweet photo. AutomationBench tests how models perform on the trickiest, stickiest real-world workflows we know customers are actually trying to automate. 600 tasks, 6 domains, deterministic scoring.

And today our scores are featured on @AnthropicAI's official launch scorecard. https://t.co/KwK3y75WUk

1

5

2

0

225

danielwshepard retweeted

Lisan al Gaib

@scaling01

14 days ago

Opus 4.8 ranks #1 on AutomationBench AutomationBench measures whether an agent can complete a realistic end-to-end business workflow

scaling01's tweet photo. Opus 4.8 ranks #1 on AutomationBench

AutomationBench measures whether an agent can complete a realistic end-to-end business workflow https://t.co/pWWHTXPJLK

4

150

9

12

15K

danielwshepard retweeted

Zapier

@zapier

14 days ago

Opus 4.8, the first model to break 15% on AutomationBench, is now live in Zapier! It handles complex HR, Finance, and multi-app workflows better than anything else we've tested: refusals dropped from 20% to 4% Opus 4.7 would see a sensitive task and stop, but 4.8 keeps going

zapier's tweet photo. Opus 4.8, the first model to break 15% on AutomationBench, is now live in Zapier!

It handles complex HR, Finance, and multi-app workflows better than anything else we've tested: refusals dropped from 20% to 4%

Opus 4.7 would see a sensitive task and stop, but 4.8 keeps going https://t.co/pDoCpr6ZNE

6

27

5

2

6K

Daniel Shepard @danielwshepard

21 days ago

@tobihanl @wadefoster @GeminiApp Yeah, AutomationBench has a maximum of 50 steps allowed. Up until recent models, this was almost never hit but High reasoning hit that limit 38% of tasks so unfortunately was limited by that. We will raise the limit for a future version. High likely would score higher otherwise.

1

0

30

Daniel Shepard

@danielwshepard

Last Seen Users on Sotwe

Trends for you

Most Popular Users