Benchmarks aren’t even real anymore.
The labs making these LLMs are overfitting the model weights to the benchmark dataset.
The ONLY way i personally believe testing a model is seeing its capabilities across real world tasks like UI, or debugging a specific problem.
WHAT THE HELL is happening in AI?
A 3B parameter model just put up coding benchmark scores in the same league as Claude Opus 4.5.
3 BILLION.
The weights are on Hugging Face, anyone can test it.
I genuinely don't know if this is a breakthrough or if the benchmarks are broken.
@kimmonismus not to mention they also limit their models on the chat interface, whereas i’m with 5.5 speaking to it without being conservative of my prompts.
@kimmonismus why do you think claude are surpassing openai in growth? I tried claude code for fable when it came out but even when i was using a older llm like opus 4.7 my limit would hit way faster than codex.
@theo yeah haha, I feel like so many people on the ai train hype anything that’s packaged as “groundbreaking”. Like fusion style model panels for example, cool benchmark but cost/latency doesn't make quite sense. That's the part a lot of non-technical people miss.
Interesting idea, but I’m trying to understand the economics here.
If Fusion is running multiple models in parallel + a judge/synthesizer, shouldn’t token cost scale pretty aggressively with panel size? @OpenRouter
AMD Ryzen AI Halo. The ultimate local AI developer platform.
Pre-order now: https://t.co/Ny0ZV8LOYi
⚡ Up to 128GB unified memory
⚡ Support for models up to 200B parameters
⚡ Windows & Linux support
⚡ Ready-to-run AI workflows out of the box
Build, prototype, and deploy locally without cloud constraints.