๐ฅ๏ธ Best Local LLMs for Consumer GPUs โ llama.cpp Guide (June 2026)
What I actually run on consumer hardware right now. Every model below runs via llama.cpp with a simple one-liner โ no Docker, no Python env, no cloud.
โโโ 8-16GB VRAM โโโ
๐น Gemma 4-12B (Google)
โข Smartest model in this size class โ competes with stuff 2ร bigger
โข Unsloth's MTP GGUFs: 162 tok/s vs 52 tok/s normal (3ร speedup)
โข Minimum 8GB VRAM recommended for Q4_K_M quant
โข GGUF โ https://t.co/VWp818MB3D
๐น LFM2.5-8B-A1B (LiquidAI)
โข Hybrid MoE, only 1B active params โ absurdly fast for its size
โข Perfect for 8-12GB cards, MacBooks, or anyone on a tight budget
โข GGUF โ https://t.co/ZbOs4mXJDq
โโโ 16-32GB VRAM โโโ
๐น Qwen3.6-27B (Qwen)
โข Scored 1.00 on tool-efficiency benchmarks โ best local agent available
โข 40 deterministic tasks, 32k/128k context needle tests โ all passed
โข GGUF โ https://t.co/n7K3sPvliE
โข MTP version (faster) โ https://t.co/gwdfnJTzcy
๐น Qwopus3.6-27B-v2 (Jackrong)
โข Best quantization of Qwen3.6-27B โ topped 5 agent & coding benchmarks (1200 samples)
โข If you're running Q4, this is the one to grab
โข GGUF โ https://t.co/tV1DFqXnOD
โข MTP version โ https://t.co/PMqz7V5ewv
๐น Gemma 4-31B QAT (Google/Unsloth)
โข QAT variant with MTP draft head: 76-125 tok/s (1.67ร speedup)
โข Excellent for multi-agent / subagent workflows
โข GGUF โ https://t.co/FgVsUX0YOB
๐น Nex-N2-Mini (Nex AGI)
โข Post-train of Qwen3.5-35B-A3B โ MoE with only 3B active params
โข Fits on 16GB+ VRAM, overflow loads from system RAM
โข Adaptive thinking saves ~20% tokens with no quality loss
โข For deep multi-step reasoning, nothing in this size comes close
โข GGUF โ https://t.co/oyC522a8Eh
โโโ Quick Picks โโโ
โข 16GB all-rounder โ Gemma 4-12B with MTP GGUFs
โข 32GB all-rounder โ Qwen3.6-27B / Qwopus-v2
โข Agents & tool use โ Qwen3.6-27B or Qwopus Q4
โข Deep reasoning โ Nex-N2-Mini (MoE, fits 16GB+)
โข Tight budget โ LFM2.5-8B-A1B
โข Cheapest full build: 1ร used RTX 3090 (24GB) + rest of PC โ $1000-1500
โโโ Setup on Windows โโโ
1. Download llama.cpp โ https://t.co/et0J7Swua7 (latest .zip)
2. Extract to any folder (e.g. C:\llama.cpp)
3. Download a .gguf from the links above (Q4_K_M or Q5_K_M for best quality/speed balance)
4. Run one of the commands below depending on your hardware
โโโ Launch Commands โโโ
SINGLE GPU โ Standard model (no MTP):
llama-server.exe ^
-m C:\models\Qwen3.6-27B-Q5_K_M.gguf ^
--ctx-size 180000 ^
--flash-attn on ^
--cache-type-k q4_0 ^
--cache-type-v q4_0 ^
--batch-size 1024 --ubatch-size 512 ^
-ngl 100 ^
-np 1 ^
--port 8080 ^
--jinja
SINGLE GPU โ MTP model (faster inference):
llama-server.exe ^
-m C:\models\Qwen3.6-27B-MTP-Q5_K_M.gguf ^
--ctx-size 180000 ^
--flash-attn on ^
--cache-type-k q4_0 ^
--cache-type-v q4_0 ^
--batch-size 1024 --ubatch-size 512 ^
--spec-type draft-mtp ^
--spec-draft-n-max 3 ^
-ngl 100 ^
-np 1 ^
--port 8080 ^
--jinja
DUAL GPU โ Split across two cards:
llama-server.exe ^
-m C:\models\Qwen3.6-27B-Q5_K_M.gguf ^
--ctx-size 180000 ^
--flash-attn on ^
--cache-type-k q4_0 ^
--cache-type-v q4_0 ^
--batch-size 1024 --ubatch-size 512 ^
-ngl 100 ^
--tensor-split 0.55,0.45 ^
--main-gpu 0 ^
-np 1 ^
--port 8080 ^
--jinja
DUAL GPU + MTP + Vision (multimodal):
llama-server.exe ^
-m C:\models\Qwen3.6-27B-MTP-Q5_K_M.gguf ^
--ctx-size 180000 ^
--flash-attn on ^
--cache-type-k q4_0 ^
--cache-type-v q4_0 ^
--batch-size 1024 --ubatch-size 512 ^
--spec-type draft-mtp ^
--spec-draft-n-max 3 ^
-ngl 100 ^
--tensor-split 0.60,0.40 ^
--main-gpu 0 ^
-np 1 ^
--port 8080 ^
--jinja ^
--mmproj C:\models\mmproj-F16.gguf
โโโ Parameter Breakdown โโโ
-m <path>
Path to your .gguf model file. Change this to wherever you downloaded it.
--ctx-size 180000
Context window in tokens. 180k = huge context for long conversations or big codebases.
Reduce to 32768 or 65536 if you don't need long context โ uses less VRAM.
--flash-attn on
Flash Attention โ dramatically speeds up inference and reduces VRAM usage.
Works on RTX 30xx/40xx/50xx. Always enable this.
--cache-type-k q4_0 / --cache-type-v q4_0
Quantizes the KV cache (key/value attention cache) to 4-bit.
This is what makes 180k context fit in VRAM. Without it, huge contexts eat all your memory.
Quality impact is minimal โ this is a free performance win.
--batch-size 1024 / --ubatch-size 512
batch-size = how many tokens are processed in one forward pass (throughput).
ubatch-size = micro-batch actually sent to the GPU per step.
Higher = faster prompt processing but needs more VRAM.
If you run out of VRAM, lower these (e.g. 512/256).
-ngl 100
Number of layers to offload to GPU. 100 = all layers on GPU (full offload).
This is what you want if the model fits in your VRAM.
If it doesn't fit, reduce this (e.g. -ngl 40) โ remaining layers run on CPU/RAM.
--tensor-split 0.55,0.45
How to split model layers across multiple GPUs. Values are ratios.
0.55,0.45 = GPU 0 gets 55% of layers, GPU 1 gets 45%.
Adjust based on your VRAM โ give more to the card with more memory.
Example: 0.70,0.30 for a 24GB + 12GB setup.
Not needed for single GPU setups.
--main-gpu 0
Which GPU handles the batch computation (the "orchestrator").
Set to 0 (your primary GPU). The other GPU(s) handle their assigned layers.
Minor performance impact โ usually just leave it at 0.
-np 1
Number of parallel slots (concurrent requests). 1 = one user at a time.
Increase to 2-4 if you want multiple clients connected simultaneously.
Each extra slot uses additional VRAM for its own KV cache.
--port 8080
Which port the server listens on. Change if port 8080 is busy.
--jinja
Enables Jinja2 template processing โ required for proper chat formatting.
Most modern models expect this. Always include it.
--spec-type draft-mtp
Enables Multi-Token Prediction (MTP) speculative decoding.
Only works with MTP GGUF models (downloaded separately).
The model predicts multiple tokens at once and verifies them โ big speed boost.
--spec-draft-n-max 3
How many tokens the MTP draft head proposes per step.
3 is a good default. Higher = potentially faster but more VRAM and may reduce quality.
--mmproj <path>
Path to the multimodal projector file (for vision models).
Enables image understanding โ paste screenshots into the web chat.
Only needed if you want vision capabilities. Omit for text-only use.
โโโ Your Hardware โ Your Command โโโ
Single GPU (8-24GB VRAM):
Use the "Single GPU" command. Change -m to your model path.
8GB card โ Gemma 4-12B Q4 or LFM2.5-8B
12GB card โ Gemma 4-12B Q5/Q6
16GB card โ Gemma 4-31B QAT Q4 or Nex-N2-Mini
24GB card โ Qwen3.6-27B Q4/Q5, Qwopus-v2, Gemma 4-31B QAT Q5/Q6
Dual GPU:
Use the "Dual GPU" command. Adjust --tensor-split based on your VRAM ratio.
24GB + 24GB โ --tensor-split 0.50,0.50
24GB + 12GB โ --tensor-split 0.70,0.30
24GB + 8GB โ --tensor-split 0.75,0.25
Want speed? Use MTP versions of models with the "MTP" commands.
Want vision? Add --mmproj with the projector file from the model's HuggingFace repo.
5. Once running, you get:
โข Web chat UI โ http://localhost:8080
โข OpenAI-compatible API โ http://localhost:8080/v1
โข Playground โ http://localhost:8080/playground
โโโ Why /v1 API Is the Killer Feature โโโ
One local endpoint replaces your entire cloud API bill. The /v1 endpoint is drop-in OpenAI-spec compatible โ every tool that speaks OpenAI just works. No custom code, no glue layer.
Works out of the box with:
โข IDEs: Cursor, Continue, Windsurf, Cline, Roo Code
โข CLI tools: aider, Open Interpreter, OpenCode
โข Frameworks: LangChain, LlamaIndex, LiteLLM
โข Any OpenAI SDK (Python, Node, Go, Rust)
Why this beats cloud APIs:
โข 100% private โ code never leaves your machine
โข $0 per token โ no rate limits, no quotas, no surprise bills
โข Works fully offline
โข Zero telemetry, no training on your data
โข Swap models by dropping in a different .gguf โ no app changes needed
โข Run 32kโ128k context windows without burning money
Good combos:
โข Cursor + Qwopus-v2 โ near-frontier quality, zero API cost
โข Continue + Qwen3.6-27B โ best local coding agent
โข aider + Gemma 4-12B MTP โ 162 tok/s, feels instant
โข OpenCode + Nex-N2-Mini โ deep reasoning on 16GB
Set any OpenAI-compatible client to your local endpoint:
set OPENAI_API_KEY=sk-dummy (any non-empty string works)
set OPENAI_BASE_URL=http://localhost:8080/v1
# every OpenAI-compatible tool now hits your local GPU
Shoutouts: @0xSero@rS_alonewolf@witcheer@UnslothAI@LottoLabs
Claude Code fully dissected!
Researchers from UCL reverse-engineered the leaked Claude source. What they found changes how you should think about agent design.
Only 1.6% of the codebase is AI decision logic.
The other 98.4% is operational infrastructure. Permission gates, tool routing, context compaction, recovery logic, session persistence. The model reasons. The harness does everything else.
This is the opposite of what most agent frameworks do today.
LangGraph routes model outputs through explicit state machines. Devin bolts heavy planners onto operational scaffolding. Claude Code gives the model maximum decision latitude inside a rich deterministic harness, and invests all its engineering effort in that harness.
The core loop is a simple while-true. Call model, run tools, repeat.
But the systems around that loop are where the real design lives:
A permission system with 7 modes and an ML classifier. Users approve 93% of prompts anyway, so the architecture compensates with automated layers instead of adding more warnings.
A 5-layer context compaction pipeline. Each layer runs only when cheaper ones fail. Budget reduction, snip, microcompact, context collapse, auto-compact.
Four extension mechanisms ordered by context cost. Hooks (zero), skills (low), plugins (medium), MCP (high). Each answers a different integration problem.
Subagents return only summary text to the parent. Their full transcripts live in sidechain files. Agent teams still cost roughly 7x the tokens of a standard session.
Resume does not restore session-scoped permissions. Trust is re-established every session. That friction is the point.
The bet behind all of this is simple. As frontier models converge on raw coding ability, the quality of the harness becomes the differentiator, not the model.
Paper: Dive into Claude Code (arXiv:2604.14228)
We've shared an article on Agent Harness and what every big company is building.
Read it below.
people keep asking me, if you were starting over today with nothing, how would you do it.
here is the honest answer, the one nobody wants because it isn't sexy.
i would buy the cheapest used card that runs a real model. a 3060 12gb, 200 bucks, and i would not spend another cent until i had squeezed everything out of it. not because i couldn't dream bigger, but because the card was never the thing standing between me and the work. it never is.
everyone thinks they are one purchase away from being ready. one more gpu, one more course, one more follower count, one more month of learning quietly until they feel good enough to be seen. and they stay there for years. waiting to be ready is the most expensive thing you will ever buy, it just never shows up on a receipt.
i started scrappy and i stayed scrappy on purpose. i posted the small wins, the cheap setups, the ugly first benchmarks, the stuff that felt too small to matter.
and the small stuff is exactly what people needed, because most of them are sitting where i was sitting, staring at a spec sheet they can't afford, convinced they can't begin.
you can begin. today. with whatever you have.
Excited to launch Luce KVFlash. We've been working harder than ever with @davideciffa to bring better DX for local AI.
Today, long context has a second memory bill nobody budgets for: the KV cache.
On Qwen3.6-27B at 256K it costs 4.6 GiB of VRAM and drags decode down to 13 tok/s, because every new token reads the whole thing.
KVFlash keeps a small pool of KV on the GPU, auto-sized to your VRAM, and pages cold 64-token chunks to host RAM, bit-exact and recallable.
decode holds a flat 38.6 tok/s from 64K to the native 256K on a 3090, 2.9x the full cache at 256K, 72 MiB resident and benchmark accuracy unchanged.
Kimi 2.7 ranked 2nd after Fable 5 and before GPT-5 xhigh
We have re-run our ErdosBench smoke test on 14 problems with Kimi 2.7, Qwen 3.7 Max, Grok 4.3 and compared it with the top performers from previous runs.
Kimi 2.7 is amazingly good. More below.
it's not lack of compute that's the issze. it's that in Europe, it's unthinkable to pay a guy in his mid 20s $600k salary and give him resources and freedom to train models without having oversight by a committee of gerontocratic professorswho don't keep up with the research
Intelligence should be open, accessible, and ready to build with, empowering every developer, everywhere.
GLM-5.2 is now available to all GLM Coding Plan users, including Lite, Pro, Max, and Team plans.
https://t.co/AedZACyzej
As our new flagship model, GLM-5.2 delivers powerful coding capabilities, usable 1M-context support, and continued strengths in long-horizon tasks.
API and Chatbot services will launch next week. The model will also be officially open-sourced next week under the MIT License.
The future of AI is open, and it belongs to the people.
@NoetekCo@quantinine@Kimi_Moonshot@basedjensen@TheAhmadOsman Impressive! Thanks for sharing! Whatโs the source for the images you included? I donโt think itโs within my budget capability to acquire 1TB of DDR5, would you say 1TB of DDR4 is way to weak? And then run the RTX6KPros in PCIe 4 instead of PCIe 5?