1. Not useful for serious coding projects
2. Extremely useful for dull agentic workflows even over powered. Like really really really good at eye popping frontend, basic python scripting calling REST APIs, web/research, traditional ML model training, more importantly work in a harness properly etc that sort of low hanging fruit stuff. (The low hanging fruit in agentic ai terms still needs Sonnet 4. level intelligence to be dependable)
Models covered under the order will be provided to the Federal Government and agencies for a thirty day window before early access begins for select trusted partners, who are yet to be determined.
🚀 MiniMax M3: Aiming for the Stars
Zhihu contributor toyama nao shares an early evaluation of MiniMax's new M3 multimodal model.
🔮 TL;DR
Back in April, GLM-5.1 pulled decisively ahead of MiniMax M2.7 and took the domestic coding crown. Two months later, MiniMax responds with M3.
The upgrade is significant: stronger reasoning, better stability, and much improved coding ability. M3 has firmly entered the "usable" tier.
⚖️ The cost? Efficiency.
Token consumption is up 77% versus M2.7—the highest among major models tested. Many medium-complexity tasks now consume 60K–70K tokens, making M3 substantially more expensive in practice.
🧠 Logic & Reasoning
Compared with DeepSeek V4 Flash, M3's strengths and weaknesses are clear.
✅ Strength 1: Long-context understanding
M3 shows excellent long-context hallucination control, reliably retrieving information from deep inside large documents.
On difficult retrieval-heavy tasks, it performs similarly to Qwen3.7-Max and ranks among the strongest domestic models.
✅ Strength 2: Complex reasoning
Long-chain reasoning is a major improvement over M2.7.
M3 has entered the top tier of Chinese models, solving problems through careful step-by-step exploration rather than relying on sudden insights.
⚠️ Weakness 1: Instruction following
M3 handles short and clear prompts extremely well, but performance becomes less predictable with long instructions and extended contexts.
As conversations grow longer, the model can suddenly lose track of earlier requirements.
⚠️ Weakness 2: Reasoning efficiency
M3 often consumes more tokens than DS4 Flash on comparable tasks.
Even medium-difficulty problems regularly exceed 30K tokens, with reasoning traces filled with repeated self-checks and verbose intermediate steps.
💻 Coding Performance
M3 is a substantial leap over M2.7, especially in frontend development and software engineering workflows.
Its coding behavior is highly structured: planning first, implementing module by module, testing continuously, and validating before delivery.
✅ Strength 1: Better architecture design
M3 is much stronger at choosing practical architectures that fit project requirements without overengineering.
✅ Strength 2: Strong self-testing
A large portion of M3's coding process is dedicated to self-debugging and validation. For complex issues, it can often locate problems efficiently on its own.
⚠️ Weakness 1: Expensive development cycles
Self-testing is also costly. A single task may require dozens of debugging rounds, with testing consuming more tokens than coding itself.
⚠️ Weakness 2: Requirement drift
As context grows, M3 can gradually forget parts of the original specification. The final output may pass tests while still missing requested functionality.
For best results, changes should remain relatively small and manageable.
Overall, M3's coding ability has crossed the usability threshold and clearly outperforms M2.7, though it remains behind Opus in efficiency, detail control, and overall engineering quality.
🧭 Final Thoughts
There is no magic in the LLM race.
M3 arrived only two months after M2.7, yet the progress is substantial. If M2.7 concentrated heavily on Agent capabilities, M3 appears to rebalance toward broader general intelligence.
The key tradeoff is clear:
👉 Prioritize delivery quality first.
👉 Optimize efficiency later.
That choice helped M3 close much of the capability gap—but its soaring token consumption may be the next challenge MiniMax has to solve.
📖 Full article:
https://t.co/OlHDJgwJ71
#MiniMax #MiniMaxM3 #AI #LLM #AICoding #Agent #MultimodalAI
It’s been a heavy day of usage of Minimax M3 in a very complex Hermes Agent fork deployed in a 20000 member discord group.
1. I feel below observation is spot on
2. But also it also is good enough for python, frontend, web search , market calls, handling multi-modality basically cover 80% that a trading related server would require.
3. the fact that surprises me is that the model tries to include recursive self improvement concept into anything it’s touching in the harness.
4. terminal tasks like Claude but has codex like (early 5.x series autism)
5. thing i agree about what most people are saying “model is not lazy”, not true in my case.
Signal is messy. Will test more.
Also it’s too slow !
Minimax M3 results are now live on GBENCH:
It's a solid model, but the other Chinese labs with April releases had slightly better models.
The main thing to worry about is benchmaxxing -- their model card was NOT accurate.
Our evaluations are designed to resist this kind of overfitting.
Not enough political motivation yet. But desire is popping up in the discourse everywhere (eg Sanders’ recent take). Discourse should lead to formation of polarising camps leading political motivation.
do you think we see a large nationalized US government agi project in the next five years? or a chinese version, which probably forces our consolidation as well? why/why not?
the permanent underclass is already here, but you just haven't noticed it
here's what it looks like:
- frontier labs keep their best models to themselves for 1-3 months to make sure it's safe
- then they sell the tokens to the US government and trillion dollar companies
- after that allied countries get access
- and only then do the poors get access to it after half a year of waiting. meanwhile they are already on Mythos 2 that is exponentially better
Minimax m3 is wild and it broke the one rule every ai model has followed which is better costs = better capability...
if you put every model on a graph, price on one side, how good it is on the other.. they all fall along a straight line..
cheap / weaker models sit bottom left and expensive / stronger ones sit top right.. you pay more, you get more simple as that..
picture a diagonal from cheap and weak (bottom left) to expensive and strong (top right)..
that line is the going rate of how much capability your money buys... every model pays it... when m3 is the first to get more than it paid for, landing above the line where nothing has ever been..
it's as capable as the mid tier frontier models, but priced like the cheapest ones $1.20..
and the bigger part is that m3 is open weight so for the first time, the best value on the chart is also the one you fully own..
I love Pi agent, so much so I made a full 1:1 Python port and called it Harn (from Old Norse for "brain" and also to reference "harness") https://t.co/LJzuO8R5lp
MiniMax M3 just dropped — their first natively multimodal model.
So I ran it through my form-filling test. (The model has to place each element at the right pixel position on a blank form image, not type into a field.)
Verdict: it got everything on the paper.
> Name, DOB, ID, gender, marital status, nationality, email, phone, address, postal code, all there.
> Best character spacing I've seen yet: it actually calculates the gap between each character, clean across the DOB and number boxes
> A few fields slightly misaligned, but every piece of data made it onto the form
The reasoning chain is the interesting part: it does the easy fields first, then works into the tight one-char-per-box fields, reasoning through y-coordinates, baselines, and label clearance in obsessive detail.
The cost: 40:33 and 126.7k output tokens. That's a long think — but it's MiniMax's first multimodal model, and it nailed the content.