Okay, I'm just going to come out and say it. We have to start sharing token use alongside model performance.
I don't think benchmarks are as useful if you see one model is 6% more accurate than another, but don't know if one uses 600% more tokens than the other.
A good model should have a balance of accuracy, coupled with strong token use.
This is why I share this in all of my benchmarks. Take this one from yesterday. If you just looked at the results you would say, oh GLM 5.2 High ties Fable 5.2 Low and Sonnet 5 High.
But the reality is, to tie them both, it had to use 7,628% more tokens and the cost, 596% more.
Most benchmarks would just show all these models against each other with one accuracy score. This doesn't tell the whole story.
We can do better.
@HancockErikL@Robert78298457@morganlinton Good question, definitely a lot. But then it wouldn't be an interesting benchmark because it wouldn't reflect the experience most users would have.
Okay, I'm just going to come out and say it. We have to start sharing token use alongside model performance.
I don't think benchmarks are as useful if you see one model is 6% more accurate than another, but don't know if one uses 600% more tokens than the other.
A good model should have a balance of accuracy, coupled with strong token use.
This is why I share this in all of my benchmarks. Take this one from yesterday. If you just looked at the results you would say, oh GLM 5.2 High ties Fable 5.2 Low and Sonnet 5 High.
But the reality is, to tie them both, it had to use 7,628% more tokens and the cost, 596% more.
Most benchmarks would just show all these models against each other with one accuracy score. This doesn't tell the whole story.
We can do better.
@stochastichimp While I don't think they are pointless without the token use/costs, they certainly do only tell only part of the story if they don't include this data.
Our latest benchmark is now live, but we're holding off on sharing it anywhere but this livestream for 24-hours.
So if you want to see the results of the comparison between Fable 5 Low, Fable 5 Medium, Opus 4.8 High, GPT 5.5 High.
You'll have to watch the stream.
Live long and benchmark π
Just finished my livestream sharing the results of my latest benchmark comparing:
Fable 5 Low, Fable 5 High, Opus 4.8 High, GPT 5.5 High.
I decided I'm going to delay sharing the results anywhere but this livestream for 24-hours, so the only way to see the results early, is to watch!
Also, had some awesome comments from people on the livestream and am including links to a couple of the projects that they shared.
If you join my livestream and participate, I'm always happy to share what you're building to help get the word out for you. I'm here to support builders.
And now, here's the livestream π½
https://t.co/2r6AT6K0Z6
@lyknoada99@morganlinton The purpose of this benchmark was to see how Fable Low performs. In this case, it is more performant than GLM 5.2 High and Sonnet 5 High.
A lot of people are very likely using Fable with the wrong effort level, you likely only need high effort an incredibly small % of the time.