New research published in @Nature with @GoogleDeepMind shows how AMIE utilizes long-context capabilities of Gemini models to ground its clinical reasoning in authoritative medical knowledge.
In a multi-visit study with patient actors, AMIE demonstrated physician-level capabilities in longitudinal management reasoning, while scoring significantly higher in plan preciseness and guideline alignment.
See the full story from @GoogleForHealth below 🧵↓
Vector databases are officially cooked 🤯
This repo shrinks 60 million text chunks from 201 GB to just 6 GB without any loss in accuracy.
→ 97% smaller storage
→ No cloud. No GPU.
→ Runs on a standard laptop
→ Fully private
100% open-source.
WHAT THE HELL is happening in AI?
A 3B parameter model just put up coding benchmark scores in the same league as Claude Opus 4.5.
3 BILLION.
The weights are on Hugging Face, anyone can test it.
I genuinely don't know if this is a breakthrough or if the benchmarks are broken.
Google keeps sharing a ton of free content!
Here is one of their most complete examples: a long-running AI agent that pauses, resumes, and never loses context.
It simulates the onboarding of a new employee.
There are the 3 architectural patterns you'll learn from this:
🖥️ Best Local LLMs for Consumer GPUs — llama.cpp Guide (June 2026)
What I actually run on consumer hardware right now. Every model below runs via llama.cpp with a simple one-liner — no Docker, no Python env, no cloud.
━━━ 8-16GB VRAM ━━━
🔹 Gemma 4-12B (Google)
• Smartest model in this size class — competes with stuff 2× bigger
• Unsloth's MTP GGUFs: 162 tok/s vs 52 tok/s normal (3× speedup)
• Minimum 8GB VRAM recommended for Q4_K_M quant
• GGUF → https://t.co/VWp818MB3D
🔹 LFM2.5-8B-A1B (LiquidAI)
• Hybrid MoE, only 1B active params — absurdly fast for its size
• Perfect for 8-12GB cards, MacBooks, or anyone on a tight budget
• GGUF → https://t.co/ZbOs4mXJDq
━━━ 16-32GB VRAM ━━━
🔹 Qwen3.6-27B (Qwen)
• Scored 1.00 on tool-efficiency benchmarks — best local agent available
• 40 deterministic tasks, 32k/128k context needle tests — all passed
• GGUF → https://t.co/n7K3sPvliE
• MTP version (faster) → https://t.co/gwdfnJTzcy
🔹 Qwopus3.6-27B-v2 (Jackrong)
• Best quantization of Qwen3.6-27B — topped 5 agent & coding benchmarks (1200 samples)
• If you're running Q4, this is the one to grab
• GGUF → https://t.co/tV1DFqXnOD
• MTP version → https://t.co/PMqz7V5ewv
🔹 Gemma 4-31B QAT (Google/Unsloth)
• QAT variant with MTP draft head: 76-125 tok/s (1.67× speedup)
• Excellent for multi-agent / subagent workflows
• GGUF → https://t.co/FgVsUX0YOB
🔹 Nex-N2-Mini (Nex AGI)
• Post-train of Qwen3.5-35B-A3B — MoE with only 3B active params
• Fits on 16GB+ VRAM, overflow loads from system RAM
• Adaptive thinking saves ~20% tokens with no quality loss
• For deep multi-step reasoning, nothing in this size comes close
• GGUF → https://t.co/oyC522a8Eh
━━━ Quick Picks ━━━
• 16GB all-rounder → Gemma 4-12B with MTP GGUFs
• 32GB all-rounder → Qwen3.6-27B / Qwopus-v2
• Agents & tool use → Qwen3.6-27B or Qwopus Q4
• Deep reasoning → Nex-N2-Mini (MoE, fits 16GB+)
• Tight budget → LFM2.5-8B-A1B
• Cheapest full build: 1× used RTX 3090 (24GB) + rest of PC ≈ $1000-1500
━━━ Setup on Windows ━━━
1. Download llama.cpp → https://t.co/et0J7Swua7 (latest .zip)
2. Extract to any folder (e.g. C:\llama.cpp)
3. Download a .gguf from the links above (Q4_K_M or Q5_K_M for best quality/speed balance)
4. Run one of the commands below depending on your hardware
━━━ Launch Commands ━━━
SINGLE GPU — Standard model (no MTP):
llama-server.exe ^
-m C:\models\Qwen3.6-27B-Q5_K_M.gguf ^
--ctx-size 180000 ^
--flash-attn on ^
--cache-type-k q4_0 ^
--cache-type-v q4_0 ^
--batch-size 1024 --ubatch-size 512 ^
-ngl 100 ^
-np 1 ^
--port 8080 ^
--jinja
SINGLE GPU — MTP model (faster inference):
llama-server.exe ^
-m C:\models\Qwen3.6-27B-MTP-Q5_K_M.gguf ^
--ctx-size 180000 ^
--flash-attn on ^
--cache-type-k q4_0 ^
--cache-type-v q4_0 ^
--batch-size 1024 --ubatch-size 512 ^
--spec-type draft-mtp ^
--spec-draft-n-max 3 ^
-ngl 100 ^
-np 1 ^
--port 8080 ^
--jinja
DUAL GPU — Split across two cards:
llama-server.exe ^
-m C:\models\Qwen3.6-27B-Q5_K_M.gguf ^
--ctx-size 180000 ^
--flash-attn on ^
--cache-type-k q4_0 ^
--cache-type-v q4_0 ^
--batch-size 1024 --ubatch-size 512 ^
-ngl 100 ^
--tensor-split 0.55,0.45 ^
--main-gpu 0 ^
-np 1 ^
--port 8080 ^
--jinja
DUAL GPU + MTP + Vision (multimodal):
llama-server.exe ^
-m C:\models\Qwen3.6-27B-MTP-Q5_K_M.gguf ^
--ctx-size 180000 ^
--flash-attn on ^
--cache-type-k q4_0 ^
--cache-type-v q4_0 ^
--batch-size 1024 --ubatch-size 512 ^
--spec-type draft-mtp ^
--spec-draft-n-max 3 ^
-ngl 100 ^
--tensor-split 0.60,0.40 ^
--main-gpu 0 ^
-np 1 ^
--port 8080 ^
--jinja ^
--mmproj C:\models\mmproj-F16.gguf
━━━ Parameter Breakdown ━━━
-m <path>
Path to your .gguf model file. Change this to wherever you downloaded it.
--ctx-size 180000
Context window in tokens. 180k = huge context for long conversations or big codebases.
Reduce to 32768 or 65536 if you don't need long context — uses less VRAM.
--flash-attn on
Flash Attention — dramatically speeds up inference and reduces VRAM usage.
Works on RTX 30xx/40xx/50xx. Always enable this.
--cache-type-k q4_0 / --cache-type-v q4_0
Quantizes the KV cache (key/value attention cache) to 4-bit.
This is what makes 180k context fit in VRAM. Without it, huge contexts eat all your memory.
Quality impact is minimal — this is a free performance win.
--batch-size 1024 / --ubatch-size 512
batch-size = how many tokens are processed in one forward pass (throughput).
ubatch-size = micro-batch actually sent to the GPU per step.
Higher = faster prompt processing but needs more VRAM.
If you run out of VRAM, lower these (e.g. 512/256).
-ngl 100
Number of layers to offload to GPU. 100 = all layers on GPU (full offload).
This is what you want if the model fits in your VRAM.
If it doesn't fit, reduce this (e.g. -ngl 40) — remaining layers run on CPU/RAM.
--tensor-split 0.55,0.45
How to split model layers across multiple GPUs. Values are ratios.
0.55,0.45 = GPU 0 gets 55% of layers, GPU 1 gets 45%.
Adjust based on your VRAM — give more to the card with more memory.
Example: 0.70,0.30 for a 24GB + 12GB setup.
Not needed for single GPU setups.
--main-gpu 0
Which GPU handles the batch computation (the "orchestrator").
Set to 0 (your primary GPU). The other GPU(s) handle their assigned layers.
Minor performance impact — usually just leave it at 0.
-np 1
Number of parallel slots (concurrent requests). 1 = one user at a time.
Increase to 2-4 if you want multiple clients connected simultaneously.
Each extra slot uses additional VRAM for its own KV cache.
--port 8080
Which port the server listens on. Change if port 8080 is busy.
--jinja
Enables Jinja2 template processing — required for proper chat formatting.
Most modern models expect this. Always include it.
--spec-type draft-mtp
Enables Multi-Token Prediction (MTP) speculative decoding.
Only works with MTP GGUF models (downloaded separately).
The model predicts multiple tokens at once and verifies them — big speed boost.
--spec-draft-n-max 3
How many tokens the MTP draft head proposes per step.
3 is a good default. Higher = potentially faster but more VRAM and may reduce quality.
--mmproj <path>
Path to the multimodal projector file (for vision models).
Enables image understanding — paste screenshots into the web chat.
Only needed if you want vision capabilities. Omit for text-only use.
━━━ Your Hardware → Your Command ━━━
Single GPU (8-24GB VRAM):
Use the "Single GPU" command. Change -m to your model path.
8GB card → Gemma 4-12B Q4 or LFM2.5-8B
12GB card → Gemma 4-12B Q5/Q6
16GB card → Gemma 4-31B QAT Q4 or Nex-N2-Mini
24GB card → Qwen3.6-27B Q4/Q5, Qwopus-v2, Gemma 4-31B QAT Q5/Q6
Dual GPU:
Use the "Dual GPU" command. Adjust --tensor-split based on your VRAM ratio.
24GB + 24GB → --tensor-split 0.50,0.50
24GB + 12GB → --tensor-split 0.70,0.30
24GB + 8GB → --tensor-split 0.75,0.25
Want speed? Use MTP versions of models with the "MTP" commands.
Want vision? Add --mmproj with the projector file from the model's HuggingFace repo.
5. Once running, you get:
• Web chat UI → http://localhost:8080
• OpenAI-compatible API → http://localhost:8080/v1
• Playground → http://localhost:8080/playground
━━━ Why /v1 API Is the Killer Feature ━━━
One local endpoint replaces your entire cloud API bill. The /v1 endpoint is drop-in OpenAI-spec compatible — every tool that speaks OpenAI just works. No custom code, no glue layer.
Works out of the box with:
• IDEs: Cursor, Continue, Windsurf, Cline, Roo Code
• CLI tools: aider, Open Interpreter, OpenCode
• Frameworks: LangChain, LlamaIndex, LiteLLM
• Any OpenAI SDK (Python, Node, Go, Rust)
Why this beats cloud APIs:
• 100% private — code never leaves your machine
• $0 per token — no rate limits, no quotas, no surprise bills
• Works fully offline
• Zero telemetry, no training on your data
• Swap models by dropping in a different .gguf — no app changes needed
• Run 32k–128k context windows without burning money
Good combos:
• Cursor + Qwopus-v2 → near-frontier quality, zero API cost
• Continue + Qwen3.6-27B → best local coding agent
• aider + Gemma 4-12B MTP → 162 tok/s, feels instant
• OpenCode + Nex-N2-Mini → deep reasoning on 16GB
Set any OpenAI-compatible client to your local endpoint:
set OPENAI_API_KEY=sk-dummy (any non-empty string works)
set OPENAI_BASE_URL=http://localhost:8080/v1
# every OpenAI-compatible tool now hits your local GPU
Shoutouts: @0xSero@rS_alonewolf@witcheer@UnslothAI@LottoLabs
Excited to launch Luce KVFlash. We've been working harder than ever with @davideciffa to bring better DX for local AI.
Today, long context has a second memory bill nobody budgets for: the KV cache.
On Qwen3.6-27B at 256K it costs 4.6 GiB of VRAM and drags decode down to 13 tok/s, because every new token reads the whole thing.
KVFlash keeps a small pool of KV on the GPU, auto-sized to your VRAM, and pages cold 64-token chunks to host RAM, bit-exact and recallable.
decode holds a flat 38.6 tok/s from 64K to the native 256K on a 3090, 2.9x the full cache at 256K, 72 MiB resident and benchmark accuracy unchanged.
This is really big news. Google introduced the Open Knowledge Format (OKF) - a standardized way to store information in a directory of markdown files. Makes it really easy to make a digital brain that agents can use.
These files can serve as a living wiki. You can give agents the ability to query them or edit them. They can interlink.
Seems to me this could replace Notion or Obsidian. I can think of so many uses for this.
Google's blog post: https://t.co/DqSjg4UpvH
An easier to understand explanation is the SPEC.md file:
https://t.co/A3qSz3Tfas
I gave those two links to Antigravity and asked how we could use it for any of the projects we're working on. It came up with so many ideas. I would imagine Claude Fable 5 would whip up some pretty amazing things based on this system.
Currently creating an OKF library of our pepper garden. It's going to be a fun weekend.
After spending more time down this rabbit hole lately, it seems like @NousResearch Hermes + GBrain (+ Obsidian + GitHub) is the most optimal path
I’ve been setting up an Hermes agent recently and I am wildly impressed by how good it is. With building a strong foundation that is portable + scalable + lightweight, it’s become clear where this direction is heading and I don’t feel pressured by model lock-in
This is 100% the future of agentic workflows
Google releases DiffusionGemma.✨
The new 26B-A4B diffusion text model runs locally on 18GB RAM.
It supports high-speed text generation, thinking, image, video and 256K context.
Run and train via Unsloth Studio.
GGUF: https://t.co/ZH0dCJQ59P
Guide: https://t.co/wYLfJWE6kG
Our Anthropic bill is about to jump from $400K → $1.4M/yr.
Not because usage exploded, but because we're about to cross 150 seats.
Past 150 seats you're forced into Enterprise tier. Seats stop including any usage, every token bills at standard API rates. At our current run rate that's 3.5x overnight.
Unfiltered thoughts on AI spend:
1. We should spend tokens to grow as aggressively as possible. But most people (me included) aren't conscious of what they're spending.
2. Visibility comes first. People see their personal number and they're shocked. I accidentally spent $4,000 in 3 days in Claude Code.
3. For engineering the spend is clearly worth it. Pay for the best model, it saves more than it costs.
4. For a lot of other roles it's questionable. Apps nobody uses, skills someone already built. No ROI.
5. Spend limits are coming. We already require approval for more tokens on our support team.
The era of token-maxxing is coming to an end.
1- AI-generated code just creates more technical debt.
2- At the end of the day, you (the developer) are responsible for the code, not the LLM that generated it. So the less code you have, the easier it is for you to own it.
3- A good engineer knows what code to write, and equally importantly, what code not to write or to delete.
(BTW, this is what we old-school software engineers have been saying for a long time, and we've been called all sorts of names for it.)
> Be me
> Wonder why my new Logitech bluetooth mouse keeps disconnecting from its bluetooth connection
> Use keyboard to go into Settings, turn BT off and then back on and it starts working again
> Think it's something blocking the signal, but it's not
> This happens every five fucking minutes now
> Check battery. Battery is brand new. Replace it anyway
> Every five minutes it dies, I have to go into settings, etc etc
> Finally have enough, look into buying new mouse
> Before purchasing, decide to look up any issues with this BT mouse
> Ask Grok, something that AI is actually very, VERY good for. Turns out Microsoft is entirely to blame
> The fix: you have to go into your Device Manager
> Device Manager is in your Control Panel
> Your Control Panel is a legacy windows interface that Microsoft is desperately trying their best to make you stop using for some fucking reason, they want you to use Settings instead
> But you can't apply this fix without opening your Device Manager
> Go into your Device Manager, open the Bluetooth pulldown, select Intel(R) Wireless Bluetooth(R) and don't get ANY ideas about infringing on Intel's wireless bluetooth trademarks and copyrights because why the fuck else would they inform you of this in the fucking Device Manager
> Right-click, Properties. Select the Power Management tab
> Unclick "Allow this computer to turn off device to save power"
> MS apparently made it so that it turns the device all the way off, they set it to "ON", and they made that the default setting WITHOUT TELLING FUCKING ANYONE
> Begin to understand why everyone fucking hates Windows 11