VulcanBench

Verified account

@VulcanBench

Open Source LLM benchmarking tool, focused on real world tests, large codebases, full transparency. An Open Source project by @morganlinton.

Lake Tahoe

Joined March 2020

14 Following

484 Followers

500 Posts

Pinned Tweet

about 9 hours ago

Okay, I'm just going to come out and say it. We have to start sharing token use alongside model performance. I don't think benchmarks are as useful if you see one model is 6% more accurate than another, but don't know if one uses 600% more tokens than the other. A good model should have a balance of accuracy, coupled with strong token use. This is why I share this in all of my benchmarks. Take this one from yesterday. If you just looked at the results you would say, oh GLM 5.2 High ties Fable 5.2 Low and Sonnet 5 High. But the reality is, to tie them both, it had to use 7,628% more tokens and the cost, 596% more. Most benchmarks would just show all these models against each other with one accuracy score. This doesn't tell the whole story. We can do better.

VulcanBench's tweet photo. Okay, I'm just going to come out and say it. We have to start sharing token use alongside model performance.

I don't think benchmarks are as useful if you see one model is 6% more accurate than another, but don't know if one uses 600% more tokens than the other.

A good model should have a balance of accuracy, coupled with strong token use.

This is why I share this in all of my benchmarks. Take this one from yesterday. If you just looked at the results you would say, oh GLM 5.2 High ties Fable 5.2 Low and Sonnet 5 High.

But the reality is, to tie them both, it had to use 7,628% more tokens and the cost, 596% more.

Most benchmarks would just show all these models against each other with one accuracy score. This doesn't tell the whole story.

We can do better.

5

51

9

3

4K

about 1 hour ago

@daniel_mac8 So excited to benchmark it 🖖

0

1

0

0

12

about 1 hour ago

@theo Yesss!

0

1

0

0

6

about 1 hour ago

@mweinbach I need them all to benchmark 🖖

0

1

0

0

10

Who to follow

@Super_Bad_Seed

I like Geo-politics & Diplomacy 🗺️

about 1 hour ago

@HancockErikL @Robert78298457 @morganlinton Good question, definitely a lot. But then it wouldn't be an interesting benchmark because it wouldn't reflect the experience most users would have.

0

1

0

0

7

about 1 hour ago

@theo Honor to see you in my replies Theo! 🙏

0

0

0

0

14

about 9 hours ago

Okay, I'm just going to come out and say it. We have to start sharing token use alongside model performance. I don't think benchmarks are as useful if you see one model is 6% more accurate than another, but don't know if one uses 600% more tokens than the other. A good model should have a balance of accuracy, coupled with strong token use. This is why I share this in all of my benchmarks. Take this one from yesterday. If you just looked at the results you would say, oh GLM 5.2 High ties Fable 5.2 Low and Sonnet 5 High. But the reality is, to tie them both, it had to use 7,628% more tokens and the cost, 596% more. Most benchmarks would just show all these models against each other with one accuracy score. This doesn't tell the whole story. We can do better.

VulcanBench's tweet photo. Okay, I'm just going to come out and say it. We have to start sharing token use alongside model performance.

I don't think benchmarks are as useful if you see one model is 6% more accurate than another, but don't know if one uses 600% more tokens than the other.

A good model should have a balance of accuracy, coupled with strong token use.

This is why I share this in all of my benchmarks. Take this one from yesterday. If you just looked at the results you would say, oh GLM 5.2 High ties Fable 5.2 Low and Sonnet 5 High.

But the reality is, to tie them both, it had to use 7,628% more tokens and the cost, 596% more.

Most benchmarks would just show all these models against each other with one accuracy score. This doesn't tell the whole story.

We can do better.

5

51

9

3

4K

about 1 hour ago

@butaji Yes Vitaly!

0

1

0

0

19

about 7 hours ago

@mercor_ai Are you able to share token use across these?

2

3

0

0

1K

about 7 hours ago

@morganlinton Live long and benchmark 🖖

0

1

0

0

34

about 7 hours ago

@stochastichimp While I don't think they are pointless without the token use/costs, they certainly do only tell only part of the story if they don't include this data.

1

1

0

0

38

about 7 hours ago

@stochastichimp Very true.

0

0

0

0

38

about 7 hours ago

@mbohnert Exactly.

0

1

0

0

43

about 7 hours ago

@JRoy777 @morganlinton Agreed, most people rarely need more than Fable Medium. They'll get the same results with High and Extra, just pay more.

0

2

0

0

14

about 7 hours ago

@TheCoderBtw Yes, so important to show token use in benchmarks.

1

2

0

0

14

about 9 hours ago

Our latest benchmark is now live, but we're holding off on sharing it anywhere but this livestream for 24-hours. So if you want to see the results of the comparison between Fable 5 Low, Fable 5 Medium, Opus 4.8 High, GPT 5.5 High. You'll have to watch the stream. Live long and benchmark 🖖

about 9 hours ago

Just finished my livestream sharing the results of my latest benchmark comparing: Fable 5 Low, Fable 5 High, Opus 4.8 High, GPT 5.5 High. I decided I'm going to delay sharing the results anywhere but this livestream for 24-hours, so the only way to see the results early, is to watch! Also, had some awesome comments from people on the livestream and am including links to a couple of the projects that they shared. If you join my livestream and participate, I'm always happy to share what you're building to help get the word out for you. I'm here to support builders. And now, here's the livestream 🔽 https://t.co/2r6AT6K0Z6

8

35

3

16

3K

0

4

1

1

332

about 9 hours ago

@jnikolaidis @morganlinton @grok I am calling Fable through the API so no promos here, the usage promo is only for using it within a sub.

0

1

0

0

16

1 day ago

@lyknoada99 @morganlinton The purpose of this benchmark was to see how Fable Low performs. In this case, it is more performant than GLM 5.2 High and Sonnet 5 High. A lot of people are very likely using Fable with the wrong effort level, you likely only need high effort an incredibly small % of the time.

0

3

0

0

66

1 day ago

@_thomasip @morganlinton I like this idea.

0

1

0

0

7

1 day ago

@cristianuibar @morganlinton Great combo.

0

1

0

0

7

1 day ago

@thejorgg @morganlinton Definitely has an overthinking problem, almost needs a harness specifically optimized for this issue.

1

2

0

0

27

Last Seen Users on Sotwe

Trends for you

Most Popular Users