Dawid Rutkowski

🖥️ Best Local LLMs for Consumer GPUs — llama.cpp Guide (June 2026) What I actually run on consumer hardware right now. Every model below runs via llama.cpp with a simple one-liner — no Docker, no Python env, no cloud. ━━━ 8-16GB VRAM ━━━ 🔹 Gemma 4-12B (Google) • Smartest model in this size class — competes with stuff 2× bigger • Unsloth's MTP GGUFs: 162 tok/s vs 52 tok/s normal (3× speedup) • Minimum 8GB VRAM recommended for Q4_K_M quant • GGUF → https://t.co/VWp818MB3D 🔹 LFM2.5-8B-A1B (LiquidAI) • Hybrid MoE, only 1B active params — absurdly fast for its size • Perfect for 8-12GB cards, MacBooks, or anyone on a tight budget • GGUF → https://t.co/ZbOs4mXJDq ━━━ 16-32GB VRAM ━━━ 🔹 Qwen3.6-27B (Qwen) • Scored 1.00 on tool-efficiency benchmarks — best local agent available • 40 deterministic tasks, 32k/128k context needle tests — all passed • GGUF → https://t.co/n7K3sPvliE • MTP version (faster) → https://t.co/gwdfnJTzcy 🔹 Qwopus3.6-27B-v2 (Jackrong) • Best quantization of Qwen3.6-27B — topped 5 agent & coding benchmarks (1200 samples) • If you're running Q4, this is the one to grab • GGUF → https://t.co/tV1DFqXnOD • MTP version → https://t.co/PMqz7V5ewv 🔹 Gemma 4-31B QAT (Google/Unsloth) • QAT variant with MTP draft head: 76-125 tok/s (1.67× speedup) • Excellent for multi-agent / subagent workflows • GGUF → https://t.co/FgVsUX0YOB 🔹 Nex-N2-Mini (Nex AGI) • Post-train of Qwen3.5-35B-A3B — MoE with only 3B active params • Fits on 16GB+ VRAM, overflow loads from system RAM • Adaptive thinking saves ~20% tokens with no quality loss • For deep multi-step reasoning, nothing in this size comes close • GGUF → https://t.co/oyC522a8Eh ━━━ Quick Picks ━━━ • 16GB all-rounder → Gemma 4-12B with MTP GGUFs • 32GB all-rounder → Qwen3.6-27B / Qwopus-v2 • Agents & tool use → Qwen3.6-27B or Qwopus Q4 • Deep reasoning → Nex-N2-Mini (MoE, fits 16GB+) • Tight budget → LFM2.5-8B-A1B • Cheapest full build: 1× used RTX 3090 (24GB) + rest of PC ≈ $1000-1500 ━━━ Setup on Windows ━━━ 1. Download llama.cpp → https://t.co/et0J7Swua7 (latest .zip) 2. Extract to any folder (e.g. C:\llama.cpp) 3. Download a .gguf from the links above (Q4_K_M or Q5_K_M for best quality/speed balance) 4. Run one of the commands below depending on your hardware ━━━ Launch Commands ━━━ SINGLE GPU — Standard model (no MTP): llama-server.exe ^ -m C:\models\Qwen3.6-27B-Q5_K_M.gguf ^ --ctx-size 180000 ^ --flash-attn on ^ --cache-type-k q4_0 ^ --cache-type-v q4_0 ^ --batch-size 1024 --ubatch-size 512 ^ -ngl 100 ^ -np 1 ^ --port 8080 ^ --jinja SINGLE GPU — MTP model (faster inference): llama-server.exe ^ -m C:\models\Qwen3.6-27B-MTP-Q5_K_M.gguf ^ --ctx-size 180000 ^ --flash-attn on ^ --cache-type-k q4_0 ^ --cache-type-v q4_0 ^ --batch-size 1024 --ubatch-size 512 ^ --spec-type draft-mtp ^ --spec-draft-n-max 3 ^ -ngl 100 ^ -np 1 ^ --port 8080 ^ --jinja DUAL GPU — Split across two cards: llama-server.exe ^ -m C:\models\Qwen3.6-27B-Q5_K_M.gguf ^ --ctx-size 180000 ^ --flash-attn on ^ --cache-type-k q4_0 ^ --cache-type-v q4_0 ^ --batch-size 1024 --ubatch-size 512 ^ -ngl 100 ^ --tensor-split 0.55,0.45 ^ --main-gpu 0 ^ -np 1 ^ --port 8080 ^ --jinja DUAL GPU + MTP + Vision (multimodal): llama-server.exe ^ -m C:\models\Qwen3.6-27B-MTP-Q5_K_M.gguf ^ --ctx-size 180000 ^ --flash-attn on ^ --cache-type-k q4_0 ^ --cache-type-v q4_0 ^ --batch-size 1024 --ubatch-size 512 ^ --spec-type draft-mtp ^ --spec-draft-n-max 3 ^ -ngl 100 ^ --tensor-split 0.60,0.40 ^ --main-gpu 0 ^ -np 1 ^ --port 8080 ^ --jinja ^ --mmproj C:\models\mmproj-F16.gguf ━━━ Parameter Breakdown ━━━ -m <path> Path to your .gguf model file. Change this to wherever you downloaded it. --ctx-size 180000 Context window in tokens. 180k = huge context for long conversations or big codebases. Reduce to 32768 or 65536 if you don't need long context — uses less VRAM. --flash-attn on Flash Attention — dramatically speeds up inference and reduces VRAM usage. Works on RTX 30xx/40xx/50xx. Always enable this. --cache-type-k q4_0 / --cache-type-v q4_0 Quantizes the KV cache (key/value attention cache) to 4-bit. This is what makes 180k context fit in VRAM. Without it, huge contexts eat all your memory. Quality impact is minimal — this is a free performance win. --batch-size 1024 / --ubatch-size 512 batch-size = how many tokens are processed in one forward pass (throughput). ubatch-size = micro-batch actually sent to the GPU per step. Higher = faster prompt processing but needs more VRAM. If you run out of VRAM, lower these (e.g. 512/256). -ngl 100 Number of layers to offload to GPU. 100 = all layers on GPU (full offload). This is what you want if the model fits in your VRAM. If it doesn't fit, reduce this (e.g. -ngl 40) — remaining layers run on CPU/RAM. --tensor-split 0.55,0.45 How to split model layers across multiple GPUs. Values are ratios. 0.55,0.45 = GPU 0 gets 55% of layers, GPU 1 gets 45%. Adjust based on your VRAM — give more to the card with more memory. Example: 0.70,0.30 for a 24GB + 12GB setup. Not needed for single GPU setups. --main-gpu 0 Which GPU handles the batch computation (the "orchestrator"). Set to 0 (your primary GPU). The other GPU(s) handle their assigned layers. Minor performance impact — usually just leave it at 0. -np 1 Number of parallel slots (concurrent requests). 1 = one user at a time. Increase to 2-4 if you want multiple clients connected simultaneously. Each extra slot uses additional VRAM for its own KV cache. --port 8080 Which port the server listens on. Change if port 8080 is busy. --jinja Enables Jinja2 template processing — required for proper chat formatting. Most modern models expect this. Always include it. --spec-type draft-mtp Enables Multi-Token Prediction (MTP) speculative decoding. Only works with MTP GGUF models (downloaded separately). The model predicts multiple tokens at once and verifies them — big speed boost. --spec-draft-n-max 3 How many tokens the MTP draft head proposes per step. 3 is a good default. Higher = potentially faster but more VRAM and may reduce quality. --mmproj <path> Path to the multimodal projector file (for vision models). Enables image understanding — paste screenshots into the web chat. Only needed if you want vision capabilities. Omit for text-only use. ━━━ Your Hardware → Your Command ━━━ Single GPU (8-24GB VRAM): Use the "Single GPU" command. Change -m to your model path. 8GB card → Gemma 4-12B Q4 or LFM2.5-8B 12GB card → Gemma 4-12B Q5/Q6 16GB card → Gemma 4-31B QAT Q4 or Nex-N2-Mini 24GB card → Qwen3.6-27B Q4/Q5, Qwopus-v2, Gemma 4-31B QAT Q5/Q6 Dual GPU: Use the "Dual GPU" command. Adjust --tensor-split based on your VRAM ratio. 24GB + 24GB → --tensor-split 0.50,0.50 24GB + 12GB → --tensor-split 0.70,0.30 24GB + 8GB → --tensor-split 0.75,0.25 Want speed? Use MTP versions of models with the "MTP" commands. Want vision? Add --mmproj with the projector file from the model's HuggingFace repo. 5. Once running, you get: • Web chat UI → http://localhost:8080 • OpenAI-compatible API → http://localhost:8080/v1 • Playground → http://localhost:8080/playground ━━━ Why /v1 API Is the Killer Feature ━━━ One local endpoint replaces your entire cloud API bill. The /v1 endpoint is drop-in OpenAI-spec compatible — every tool that speaks OpenAI just works. No custom code, no glue layer. Works out of the box with: • IDEs: Cursor, Continue, Windsurf, Cline, Roo Code • CLI tools: aider, Open Interpreter, OpenCode • Frameworks: LangChain, LlamaIndex, LiteLLM • Any OpenAI SDK (Python, Node, Go, Rust) Why this beats cloud APIs: • 100% private — code never leaves your machine • $0 per token — no rate limits, no quotas, no surprise bills • Works fully offline • Zero telemetry, no training on your data • Swap models by dropping in a different .gguf — no app changes needed • Run 32k–128k context windows without burning money Good combos: • Cursor + Qwopus-v2 → near-frontier quality, zero API cost • Continue + Qwen3.6-27B → best local coding agent • aider + Gemma 4-12B MTP → 162 tok/s, feels instant • OpenCode + Nex-N2-Mini → deep reasoning on 16GB Set any OpenAI-compatible client to your local endpoint: set OPENAI_API_KEY=sk-dummy (any non-empty string works) set OPENAI_BASE_URL=http://localhost:8080/v1 # every OpenAI-compatible tool now hits your local GPU Shoutouts: @0xSero @rS_alonewolf @witcheer @UnslothAI @LottoLabs

886

103K

calcarinus retweeted

Daily Dose of Data Science

@DailyDoseOfDS_

1 day ago

Claude Code fully dissected! Researchers from UCL reverse-engineered the leaked Claude source. What they found changes how you should think about agent design. Only 1.6% of the codebase is AI decision logic. The other 98.4% is operational infrastructure. Permission gates, tool routing, context compaction, recovery logic, session persistence. The model reasons. The harness does everything else. This is the opposite of what most agent frameworks do today. LangGraph routes model outputs through explicit state machines. Devin bolts heavy planners onto operational scaffolding. Claude Code gives the model maximum decision latitude inside a rich deterministic harness, and invests all its engineering effort in that harness. The core loop is a simple while-true. Call model, run tools, repeat. But the systems around that loop are where the real design lives: A permission system with 7 modes and an ML classifier. Users approve 93% of prompts anyway, so the architecture compensates with automated layers instead of adding more warnings. A 5-layer context compaction pipeline. Each layer runs only when cheaper ones fail. Budget reduction, snip, microcompact, context collapse, auto-compact. Four extension mechanisms ordered by context cost. Hooks (zero), skills (low), plugins (medium), MCP (high). Each answers a different integration problem. Subagents return only summary text to the parent. Their full transcripts live in sidechain files. Agent teams still cost roughly 7x the tokens of a standard session. Resume does not restore session-scoped permissions. Trust is re-established every session. That friction is the point. The bet behind all of this is simple. As frontier models converge on raw coding ability, the quality of the harness becomes the differentiator, not the model. Paper: Dive into Claude Code (arXiv:2604.14228) We've shared an article on Agent Harness and what every big company is building. Read it below.

DailyDoseOfDS_'s tweet photo. Claude Code fully dissected!

Researchers from UCL reverse-engineered the leaked Claude source. What they found changes how you should think about agent design.

Only 1.6% of the codebase is AI decision logic.

The other 98.4% is operational infrastructure. Permission gates, tool routing, context compaction, recovery logic, session persistence. The model reasons. The harness does everything else.

This is the opposite of what most agent frameworks do today.

LangGraph routes model outputs through explicit state machines. Devin bolts heavy planners onto operational scaffolding. Claude Code gives the model maximum decision latitude inside a rich deterministic harness, and invests all its engineering effort in that harness.

The core loop is a simple while-true. Call model, run tools, repeat.

But the systems around that loop are where the real design lives:

A permission system with 7 modes and an ML classifier. Users approve 93% of prompts anyway, so the architecture compensates with automated layers instead of adding more warnings.

A 5-layer context compaction pipeline. Each layer runs only when cheaper ones fail. Budget reduction, snip, microcompact, context collapse, auto-compact.

Four extension mechanisms ordered by context cost. Hooks (zero), skills (low), plugins (medium), MCP (high). Each answers a different integration problem.

Subagents return only summary text to the parent. Their full transcripts live in sidechain files. Agent teams still cost roughly 7x the tokens of a standard session.

Resume does not restore session-scoped permissions. Trust is re-established every session. That friction is the point.

The bet behind all of this is simple. As frontier models converge on raw coding ability, the quality of the harness becomes the differentiator, not the model.

Paper: Dive into Claude Code (arXiv:2604.14228)

We've shared an article on Agent Harness and what every big company is building.

Read it below.

273

202K

Who to follow

Hanna Seppänen

@HannaSeppnen1

Upper gi-surgeon, Adj Prof, founder of Pancreative-research group in Helsinki University Hospital, Finland @HUS_fi @helsinkiuni @Pancreative_fi

Christoph Ammer-Herrmenau

@AmmerHerrmenau

Pancreatology, Pancreas2000, Microbiome, translational research, clinical trials, Endoscopy, @yourUMG, Department of Gastroenterology, Climbing🧗

Asif Halimi

@AsifHalimi

MD, PhD. FACS Chief of Upper GI Surgery, Consultant Surgeon, Umeå University hospital

Dawid Rutkowski

@calcarinus

about 6 hours ago

@sakurayukiai @l33tguy Thanks! Will check this out!

Dawid Rutkowski

@calcarinus

about 6 hours ago

@l33tguy Beautiful, do you use a PSU splitter for your multiple PSUs? Or do you just connect the additional PSUs straight to the extra GPUs?

191

calcarinus retweeted

vivek

@itsreallyvivek

5 days ago

https://t.co/ZhOgyq7Vgn

136

22K

calcarinus retweeted

Sudo su

@sudoingX

about 10 hours ago

people keep asking me, if you were starting over today with nothing, how would you do it. here is the honest answer, the one nobody wants because it isn't sexy. i would buy the cheapest used card that runs a real model. a 3060 12gb, 200 bucks, and i would not spend another cent until i had squeezed everything out of it. not because i couldn't dream bigger, but because the card was never the thing standing between me and the work. it never is. everyone thinks they are one purchase away from being ready. one more gpu, one more course, one more follower count, one more month of learning quietly until they feel good enough to be seen. and they stay there for years. waiting to be ready is the most expensive thing you will ever buy, it just never shows up on a receipt. i started scrappy and i stayed scrappy on purpose. i posted the small wins, the cheap setups, the ugly first benchmarks, the stuff that felt too small to matter. and the small stuff is exactly what people needed, because most of them are sitting where i was sitting, staring at a spec sheet they can't afford, convinced they can't begin. you can begin. today. with whatever you have.

158

Dawid Rutkowski

@calcarinus

about 9 hours ago

@pupposandro @davideciffa Impressive!

140

calcarinus retweeted

Sandro

@pupposandro

about 9 hours ago

Excited to launch Luce KVFlash. We've been working harder than ever with @davideciffa to bring better DX for local AI. Today, long context has a second memory bill nobody budgets for: the KV cache. On Qwen3.6-27B at 256K it costs 4.6 GiB of VRAM and drags decode down to 13 tok/s, because every new token reads the whole thing. KVFlash keeps a small pool of KV on the GPU, auto-sized to your VRAM, and pages cold 64-token chunks to host RAM, bit-exact and recallable. decode holds a flat 38.6 tok/s from 64K to the native 256K on a 3090, 2.9x the full cache at 256K, 72 MiB resident and benchmark accuracy unchanged.

143

137

19K

calcarinus retweeted

Andrew Carr 🤸

@andrew_n_carr

1 day ago

Fable feels like a polish freelancer. Shows up, speaks in its own dialect, writes the best code you've ever seen, and then disappears

137

336

162K

calcarinus retweeted

Ahmad

@TheAhmadOsman

about 17 hours ago

Kimi-K2.7-Code is the new Opensouece SoTA for Coding & Agentic workflows

855

192

45K

calcarinus retweeted

Przemek Chojecki | PC

@prz_chojecki

1 day ago

Kimi 2.7 ranked 2nd after Fable 5 and before GPT-5 xhigh We have re-run our ErdosBench smoke test on 14 problems with Kimi 2.7, Qwen 3.7 Max, Grok 4.3 and compared it with the top performers from previous runs. Kimi 2.7 is amazingly good. More below.

prz_chojecki's tweet photo. Kimi 2.7 ranked 2nd after Fable 5 and before GPT-5 xhigh

We have re-run our ErdosBench smoke test on 14 problems with Kimi 2.7, Qwen 3.7 Max, Grok 4.3 and compared it with the top performers from previous runs.

Kimi 2.7 is amazingly good. More below. https://t.co/pD1EFRJbAy

140

380

Dawid Rutkowski

@calcarinus

about 19 hours ago

@0xSero I keep hearing that M3 is not as strong as K2.7, people claim the former is benchmaxxed. What is your opinion on this?

calcarinus retweeted

hallerite

@hallerite

1 day ago

it's not lack of compute that's the issze. it's that in Europe, it's unthinkable to pay a guy in his mid 20s $600k salary and give him resources and freedom to train models without having oversight by a committee of gerontocratic professorswho don't keep up with the research

121

439

812

467K

Dawid Rutkowski

@calcarinus

1 day ago

@0xSero Fully agree!

calcarinus retweeted

Z.ai @Zai_org

2 days ago

Intelligence should be open, accessible, and ready to build with, empowering every developer, everywhere. GLM-5.2 is now available to all GLM Coding Plan users, including Lite, Pro, Max, and Team plans. https://t.co/AedZACyzej As our new flagship model, GLM-5.2 delivers powerful coding capabilities, usable 1M-context support, and continued strengths in long-horizon tasks. API and Chatbot services will launch next week. The model will also be officially open-sourced next week under the MIT License. The future of AI is open, and it belongs to the people.

323

935

Dawid Rutkowski

@calcarinus

2 days ago

@NoetekCo @quantinine @Kimi_Moonshot @basedjensen @TheAhmadOsman Would you favour qwen 27b fp16 over deepseek v4 flash? (Assuming you have two R6KPro)?

Dawid Rutkowski

@calcarinus

2 days ago

@sakurayukiai @outsource_ Great source, thanks!

Dawid Rutkowski

@calcarinus

2 days ago

@NoetekCo @quantinine @Kimi_Moonshot @basedjensen @TheAhmadOsman Impressive! Thanks for sharing! What’s the source for the images you included? I don’t think it’s within my budget capability to acquire 1TB of DDR5, would you say 1TB of DDR4 is way to weak? And then run the RTX6KPros in PCIe 4 instead of PCIe 5?

Dawid Rutkowski

@calcarinus

Who to follow

Last Seen Users on Sotwe

Trends for you

Most Popular Users