Beyond the Leaderboard #4: Can a 9.6GB Local Model Outcode a 400B Cloud Titan?
Four days ago, I started a benchmark series with a simple question: what happens when you measure AI models by what they actually do, not by their benchmark-suite scores?
- Day 1: Kimi K2.6 — 0.66 overall. Fast, decent, nothing special. - Day 2: DeepSeek-V4-Pro — 0.72 overall. Slow but precise. A specialist. - Day 3:MiniMax-M3 — 0.80 overall. The surprise leader. Fast, balanced, but hallucinates on recent knowledge.
Today, Day 4: Google DeepMind's Gemma4:e4b — a 9.6GB model running locally on Ollama.
Result: 0.78 overall. The second-highest score of the series, just behind MiniMax-M3's 0.80.
Wait, what?…..
🧵⬇️👇
Today we're shipping Nemotron 3 Ultra.
A 550B MoE frontier-intelligence open model built for long-running agents.
It delivers 5x faster inference and lowers the cost of complex agentic tasks by up to 30% versus other open frontier models.
🔥 MiniMax-M3 just got faster.
Thanks for all the excitement around MiniMax-M3 — the response has been far beyond our expectations.
Last night, we rolled out a major inference upgrade:
🛠️ Fixed an issue that could occasionally produce abnormal tokens
💾 Increased memory and improved cache efficiency
🚀 ~50% higher throughput, with most users now seeing 50–70 TPS
You should notice a much smoother experience today.
More optimizations are on the way. ❤️
@sakurayukiai if i go back to the very first article of the series
`rag is not enough`
and the more apporaches taken like this that stop hope dumping into the context the better.
Your memory system should not be deciding what the agent sees. The agent should. New Article on the quiet reversal in agent memory: stop injecting context, start giving the agent tools. Live now.
@PenfieldLabs graph is a powerhouse but as most systems have found maintining the graph at scale is either slow, hard, costly or all three. looking forward to seeing what the future brings in that space
The piece also covers the two-step rhythm (search returns previews, a separate call fetches the full record, ~200K tokens saved per session), and the oh-my-kiro observation pattern. https://t.co/p9YxwVeg7U
The highest-leverage refinement in the whole piece: make every tool response end with one line about what to call next. No documentation. The agent learns the API through use. The trace becomes self-documenting.
@garrytan I reviewed 19 memory systems and im pretty such each of you didnt pick the best for you, but take a look and see if you nailed it https://t.co/OrYNbZuStF
@AzFlin I'd go for a middle-ground, a trusted curated set of skills community managed with safety gates (looking at you npm) and then harnesses that are able to acquire skills on demand if they dont already have something better
Matrix style, learn to kungfu
Rag is where most agentic memory systems begun, some layered on it, some added along side it, others ripped it out entirely, but one thing is for certain
Vector RAG is not enough
https://t.co/05zbl7lwXm