@antigravity Please fix this issue, if you ask Gemini 3.5 Flash to rate something from 1 to 10, it always gives it a 7, even if the answer is a perfect 10!
It hallucinates "compression guidelines".
Thanks for the updates!
One small nit-pick: the Theme selector is a bit too visible now, it's the element in the sidebar that pops out the most, and takes a lot of attention.
Maybe remove the background around the icons?
Or even maybe replace the icon in the left with the currently selected theme, and only offer on the right the options for the other two (see 2nd image).
Adding a few more coding tests to @AIBenchy, as main use-case for LLMs atm is coding, so it makes sense that models that are better at coding to be ranked higher overall, than those who are better at trivia.
Still, general intelligence, puzzle solving and not being able to be tricked are still what I think makes a good AI and brings us closer to AGI.
@eastdakota My websites are being scraped like crazy, one website has 5000 bot fecthes pet day,
from various bote/countries.
And it's not just fetching html, some bots actually navigate, or even register.
Also, still debating how to track Input Tokens IF a request is rejected by the provider (e.g. trying to use tool_calling for models that don't support it).
Should the Input Tokens still be counted, even if the request failed, just to be consistent?
I made it so "Total Input Tokens" are displayed, to make it easier to understand the Total Cost of running benchmarks.
The Input Tokens should be mostly the same for all models, BUT when models do Tool Calling, the result of the tool + previous prompt are passed again, so depending on how they call the tools the total input tokens can vary.