New Google Gemma 4 12B claims near-26B performance - we tested both!
We ran both models locally on one RTX 4090 and gave each the same task: write a self-contained HTML5 canvas animation with real physics in one file without libraries. Three scenes - a Galton board, two blocks colliding off a wall, and a chaotic triple pendulum
Outputs:
Gemma 4 26B-A4B: 15 GB VRAM usage, 6.9k tokens, 138 tok/s
Gemma 4 12B: 9 GB VRAM usage, 8.9k tokens, 80 tok/s
Same Gemma 4 family, but the 26B-A4B won every scene and ran ~1.7x faster - on just 4B active params. The 12B stayed very close though, on almost half the VRAM - which makes it the ideal model for a 16 GB laptop
Nemotron 3 Ultra performed GPT 5.5 level 10× cheaper
We gave three same prompts to build HTML5 canvas with real physics. At first scene we have water in a spinning drum. Galton board - balls through pegs into bins. And a block collision setup with extreme mass differences.
Outputs:
Nemotron 3 Ultra: 11.3k tokens, $0.051
GPT 5.5: 11.0k tokens, $0.57
Nemotron stays right on GPT 5.5's heels, but at 10× cheaper. The gap in quality is far smaller than the gap in price.
I made @OpenAI GPT-5.5 and @GoogleDeepMind Gemini 3.1 Pro play REAL UNO — 10 games, full rules.
Gemini overthought every single card — literally 6× more reasoning tokens (133.5K vs 22.5K).
And it paid off — Gemini won 8–2.
Best part: they roast each other between cards the whole match 😂
Full stats ↓
@Ariadnavozz@atomic_chat_hq for tool/mcp calls gemma's genuinely decent, the 26B-A4B especially. basic coding it handles fine. frontier-level coding it's not - that's not what these sizes are for
@10xerik@atomic_chat_hq close on the easy scenes. the splits show on the hard physics - 26B led on speed and the pendulum, the 12B handled the block collision better. not a clean sweep either way
@prodbitz@atomic_chat_hq@atomicbot_ai ye thats the read - 4B active is why it ran ~1.7x faster than the dense 12B and still took the scenes. MoE doing its thing
@Rom609033637850@atomic_chat_hq we didn't run qwen here, it was gemma vs gemma. qwen comparison is the most requested though, doing it next. which qwen would you put up against it?
@AlanAiEngineer@atomic_chat_hq qwen's likely ahead on raw quality, fair. one note though - neither ran in 3 GB here, the 12B was ~9 GB and the 26B-A4B ~15 GB on a 4090. which gemma/quant did you try at 3?
@NikiBelokopytov@atomic_chat_hq could well be. memory-matched is the fair follow-up, we'll cut the 26B to the same budget and rerun. if the hobbled one still wins that's a real result
StepFun Step 3.7 Flash smashed DeepSeek V4-Flash in a physics contest
We gave two open-weight models the same task: write a self-contained HTML5 canvas animation with real physics in one file without libraries. Three scenes - a Galton board, balls bouncing in a spinning hexagon and five metronomes that sync up
Outputs:
Step 3.7 Flash: 59.6k tokens, 9m 57s
DeepSeek V4-Flash: 52.5k tokens, 6m 21s
DeepSeek was faster, but that's all it had. StepFun's model won on every front: physics simulation, visuals, and logical rendering of each scene
I ran a 10-game AI chess tournament: @claudeai Opus 4.8 vs @MiniMax_AI M3
API calls powered by @aimlapi
In this game, Claude crushes MiniMax in 28 moves
Claude costs ~8x more per move: ~$0.020 vs ~$0.0025
Total: ~$0.62
Full results below ↓