Here is the latest project I'm working right now
Orrery - An autonomous AI coding-loop engine
With a live orbital visualizer to watch it run
Fully open source with MIT License
Link is in the comments
Okay, I'm just going to come out and say it. We have to start sharing token use alongside model performance.
I don't think benchmarks are as useful if you see one model is 6% more accurate than another, but don't know if one uses 600% more tokens than the other.
A good model should have a balance of accuracy, coupled with strong token use.
This is why I share this in all of my benchmarks. Take this one from yesterday. If you just looked at the results you would say, oh GLM 5.2 High ties Fable 5.2 Low and Sonnet 5 High.
But the reality is, to tie them both, it had to use 7,628% more tokens and the cost, 596% more.
Most benchmarks would just show all these models against each other with one accuracy score. This doesn't tell the whole story.
We can do better.