Oh yea... now if you can mirror the production deployment and do live endpoint testings... db schema upgrades and retain functionality without regressing... you can ensure a full end to end production delivery pipeline... just make sure to figure out for ci/cd systemt and rely a little as possible on external systems. You'll start getteing throttled everywhere. Github doesn't like this. Already moved entirely selfhost. Nothing can handle my throughput without getting rate limited or throttled.
Gemma 4: Now up to 3x Faster. β‘
Same quality, way more speed. Our new MTP drafters allow Gemma 4 to predict multiple tokens at once, effectively tripling your output speed without compromising intelligence.
@sojoodi@escander007@steipete yes sir, I just switches the difference is definitely tangible and I mean... all the billing drama, they are crashing out from my POV.
@Orion_Maximus I've cancelled all Anthropic subscriptions. I felt I was incorrectly billed, but I mean my feeling and my ability to verify are two separate things. Therefore, it's just a feeling, not true smoking gun.
@mikeassad77@AlexFinn Right, same here. Not sure why gemma4 gets so much praise. Qwen3.6-35b-a3b runs faster than gemma4 has better kv cache compression with turboquant
@Esongsofficial@AlexFinn@kilocode There are no vibes anymore in his stack. It is all self planning and autonomous. Like my stack. Been on and off, it is just tough to keep up.
@tipofthespear78@AlexFinn Use MoE version. Also depends on what mac mini. Memory bandwidth matters. I am switching to rtx5090 much faster than the minis. Much more expensive tho. But it is faster than the m5max.
@AlexFinn@AlexFinn Gemma4 over Qwen3.6? I keep getting better performance on the Qwen3.6. How does Gemma4 win to Qwen3.6? Can't justify it. Too expensive to run Gemma4 in the 5090. Gemma4 3 lanes vs Qwen3.6 4 lanes w/turbo quant.
@theo It's time to move on. I found myself very surprised with open source models lately. To the point I don't even miss opus, nor sonnet nor haiku. And I am saving money now.
@steipete@steipete did you manage with github actions? or openclaw bot doing the work? I tried for like 3 months with github actions and gave up. It worked for a little bit, but then it'd be too unreliable. Now I am leaning towards just openclaw working on it directly in the repo.
The Local LLM Cheat Sheet for your 32GB RAM device
I was asked to put together a practical lineup of local models that fit comfortably on a 32GB machine.
At this tier, you start getting access to real flagship-class local models, plus a growing number of custom quants. But for most people, these are the core models worth knowing first.
Flagship Models
Qwen3.5 27B / GGUF / Q6_K_M
The best overall 32GB flagship. General chat, writing, research, and agent workflows. Great if you want one model that can handle almost everything well.
Qwen3.6-35B-A3B / GGUF / UD-Q4_K_M
Best MoE flagship. Stronger for coding, reasoning, and tool use than most smaller generalists.
Gemma 4 31B / GGUF / Q6_K_M
Dense premium model. Writing, analysis, reasoning, and high-end local chat. Heavier than the MoE options, but excellent when quality matters more than speed.
Models for Fast Flagship Use
Gemma 4 26B A4B / GGUF / Q6_K_M
Great balance of speed and quality for general assistant work, coding, agent tasks, and research. This is one of the best 32GB picks if you want something that feels high-end without dragging.
DeepSeek-R1 Distill Qwen 32B / GGUF / Q4_K_M
Offline reasoning engine. Best for math, logic, deliberate analysis, and step-by-step problem solving.
Mistral Small 24B / GGUF / Q6_K_M
Tool-calling specialist. Strong for assistants, chat workflows, local business tasks, and function calling. Available for 24GB machines.
Models for Companion Use
Qwen3.5 9B / GGUF / Q6_K_M
Best sidekick. Fast drafts, search loops, cheap retries, and secondary agent work. Even on a 32GB machine, you still want a smaller model around for support tasks.
Llama 3.1 8B / GGUF / Q6_K_M
Long-context companion. RAG, doc ingestion, codebase chat, and long prompts. The output quality is not the sharpest anymore, but it is still useful when needing simple tasks fast.
From what my community tells me, the best single models are Qwen3.5 27B or Gemma 4 31B.
For two models, the strongest general pairing is Qwen3.5 27B + Qwen3.5 9B.
If you are more code-heavy, Qwen3.6-35B-A3B + Llama 3.1 8B.
Let me know what models you are running on 32GB, and which ones have actually been worth the RAM.
@steipete Need to get traffic replays working instead of doing live API calls. Plus, likely cheaper than inference cost or api token usage. On a ci/cd system and every build running api key tests. That's a lot of api calls