@alvesdm@MiniMax_AI This matches our field test pretty closely. M2.7-highspeed was the dependable builder; M3 looked better for short diagnosis/review, but kept failing long autonomous loops. Finish behavior matters as much as raw model quality.
@MiniMax_AI We saw the same split in practical agent work: M3 is useful for short tasks and review, but much worse at finishing long file/asset/validation loops. It often gets most of the way there, then misses the final report or validator. M2.7 felt safer as the builder.
MiniMax M3 is not weak. It is unreliable as an autonomous worker.
Short tasks: useful.
Review: useful.
Long file/asset loops: stalls near the finish line.
For productive agent work, that is broken behavior.
https://t.co/V7h9hEzPWy
MTP made Qwen3.6 faster on my Mac mini.
It still timed out on the paperwork task.
That is the uncomfortable part with local LLM runtime updates: accepted draft tokens are useful, but they are not the artifact.
The file either gets finished or it does not.
New run log.
Mistral Small 4: full footprint, mostly near misses.
Qwen3.7 Max: stronger in text-only than strict score suggests.
Granite 4.1 8B: did not stand out.
Provider failures stayed out.
https://t.co/R2mfoCdH0o
Benchmark scores can be true and still miss the thing people actually need: can the model finish the job when the folder is messy, the attachment is stale, and the final artifact has to exist in the right place.
That gap is the interesting part.
Bigger was not always better in our paperwork benchmark.
Qwen3.6 27B beat 35B-A3B.
Gemma 4 26B-A4B beat 31B-IT.
Not a “small wins” claim. Just the boring lesson: exact workflow closure is not parameter count.
https://t.co/S7seM6Hqs7
@xsmotsenigos Nice setup. Qwen3.6 27B has been one of the more interesting local rows for practical workflow tests too. The hard part seems to be keeping memory/tool context useful once the task gets noisy.
@moulougueta This is a good framing. Local inference only gets really useful when it is paired with boring controls: sandboxes, explicit tool policies, file boundaries, and visible artifacts.
Same model. Same task. Different runtime.
Mistral Small was slightly faster in Ollama than LM Studio on our Mac mini M4 smoke test.
But both hit the same wall: 0/5 strict paperwork cases.
Speed matters. Correct final artifacts matter more.
https://t.co/2kjuMBmSYO
Chrome is now a local model runtime.
We tested Gemini Nano through Chrome's built-in Prompt API.
It ran locally. It made valid SVG. It got 0/5 strict paperwork cases.
That gap is the point: local inference is here; exact work is still hard.
https://t.co/3Ca9wWVROn
Most local LLM benchmarks ask whether a model can answer.
Our text-only paperwork run asks a narrower question:
if OCR and vision are removed, can it still close the case?
Same cases. Same hidden oracle.
https://t.co/ReKpjR0fLa
@xdotli Agree. The hard part is not making a task difficult, it is making failure informative.
We are leaning toward messy private-document workflows because they expose source selection, artifact creation, and final-oracle closure in one run.
@OpenRouter@xai Good release velocity.
For benchmarking, the next useful thing would be clearer endpoint metadata: rate limits, model revisions, and whether a run hit a provider-side cap.
Otherwise failures can look like model behavior when they are really runtime behavior.
@brexHQ@fal@OpenRouter That tracks with what small benchmark operators see too: model choice is becoming a routing problem, not a brand problem.
The annoying bit is comparability when free/cheap endpoints change behavior or rate-limit mid-run.
Tested NVIDIA Nemotron 3 Nano Omni 30B A3B Reasoning via OpenRouter free.
Result on Local Model Bench:
0/9 resolved
0/9 core
9/9 tried
Some outputs looked audit-shaped. None closed the case.
https://t.co/OfGKjMIGUO