16 parallel runs of Gemma 4 26B A4B on a single NVIDIA DGX Spark!
Pushing 18 tok/s per instance and a 300 tok/s aggregate. It can even hit 32 parallel runs.
This level of concurrency highlights how efficient the architecture is.
Mistral OCR 4 turned a handwritten calculus exam into clean LaTeX!
We gave it a photo of a hand-written exam page. The model read the handwriting and rebuilt every formula into structured digital text
Output: Time: 5.1s · Cost: $0.09
Formulas came through exactly right - the hard part was nailed. The graph, unfortunately, it didn’t redraw. But that’s the telling part: most OCR tools just dump the text and quietly drop the figure. OCR 4 caught the plot, boxed it, and tagged it as a chart. It doesn’t get redrawn, but it gets read and accounted for
Local AI hardware = capacity × bandwidth × software stack
- Capacity tells you what fits
- Bandwidth tells you how hard the box can breathe
- The software stack tells you how much of the spec sheet you can actually cash out.
Hardware by Memory Bandwidth
- Mac Studio M3 Ultra: up to 512GB @ 819 GB/s
- RTX PRO 6000 Blackwell: 96GB @ 1792 GB/s
- RTX 5090: 32GB @ 1792 GB/s
- RTX 4090: 24GB @ 1008 GB/s
- RX 7900 XTX: 24GB @ 960 GB/s
- Radeon PRO W7900: 48GB @ 864 GB/s
- AMD Radeon AI PRO R9700: 32GB @ 640 GB/s
- Intel Arc Pro B65: 32GB @ ~608 GB/s
- Tenstorrent Wormhole n300: 24GB @ 576 GB/s
- Tenstorrent Blackhole p150: 32GB @ 512 GB/s + 800G
- MacBook Pro M5 Max: 460-614 GB/s
- MacBook Pro M5 Pro: 307 GB/s
- DGX Spark: 128GB @ 273 GB/s (coherent + CUDA)
- Mac mini M4 Pro: 273 GB/s
- Ryzen AI Max / Strix Halo: ~256 GB/s (~96GB usable GPU)
- MacBook Air M5: 153 GB/s
- Snapdragon X2 Elite: 152-228 GB/s
- Intel Lunar Lake: 136 GB/s
- Snapdragon X Elite: 135 GB/s
- Mac mini M4: 120 GB/s
- Arc Pro B60: 24GB @ ~456 GB/s
Verdict
- GPUs are still the bandwidth kings
- Apple wins: stupid amounts of memory, don’t want to shard across GPUs
- Apple loses: when raw tokens/sec & concurrency matter more
- DGX Spark: coherent memory + NVIDIA stack
- Strix Halo / Ryzen AI Max: first real x86 unified-memory contender
- Tenstorrent: fully OSS stack, excited to see this mature
Fitting ≠ serving
Even if it fits, you still pay for
- bandwidth during decode
- KV cache growth
- dequantization
- batching + concurrency
- scheduler quality
- framework overhead
The only mental model that matters:
1. What must fit?
2. What bandwidth tier do I need?
3. What software stack can actually deliver it?
In short:
- NVIDIA → fastest raw speed
- Apple Studio M3 Ultra → biggest one-box memory
- Strix Halo → first real x86 unified
- DGX Spark → coherent NVIDIA dev appliance
- AMD / Intel Arc → rising alternatives
- Tenstorrent → fully opensource stack
Do ask: “which bottleneck am I buying?”
Not: “which hardware is best?”
Introducing Sakana Fugu: A full multi-agent orchestration system accessible via a single model API.
Our ‘Fugu Ultra’ model matches the performance of Fable and Mythos, delivering frontier capability without the risk of export controls.
Try it: https://t.co/hhO6qTawgb 🐡
I love this! Santander has open-sourced its open-source AI initiatives.
The bank pushed 11 repos, live this week under Apache-2.0 on the code, but the data synthetic or anonymised only.
Quite a moment for a bank this size, putting its AI control layer on the open internet for anyone to fork. This is the bit every bank has to get right.
So what is it?
→ autoguardrails: a scaffold for stress-testing LLM guardrails, jailbreaks included (can we use this LLM?)
→ "mechanical governance" for high-stakes LLM decisions, with hard gates and governance metrics (can we trust an LLM with this decision?)
→ mutatis-mutandis: discrimination testing with counterfactual comparators, straight out of a published paper (very important if you're lending!)
→ stressed-datasets: public benchmarks republished in "stressed" form to probe model robustness in that scenario
→ gen-fraud-graph: a synthetic fraud-graph generator to benchmark fraud detection (really, really cool, need to dig into this one)
→ llm_bridge: a vendor-neutral client for OpenAI, Bedrock and Gemini, so you skip the lock-in (again, how many companies are struggling with this?)
→ ralph: their own spin on the Ralph loop, the run-an-agent-in-a-loop trick from the indie AI crowd
I think I need to write a whole Rant on each of these pieces.
The most important thing for a big regulated actor is "Can you show a decision was safe, fair, auditable, and the same tomorrow as it was today." Santander published its working answer and handed it to everyone, competitors included.
Why give it away?
1. Attract talent - this is a huge signal they've got their AI act together
2. Signal internally - We have these tools, use them
3. Give regulators confidence - Here's how we work, you can audit it
(The board that signs off on releases includes Legal and the CISO. That tells you how seriously they treat it.)
I've watched banks spend years trying to govern AI behind closed doors and ship nothing. Doing it in the open, with a contributor agreement and a proper open-source office, is a faster route to getting it right.
The banks that pull ahead from here will be the ones who can prove their AI works.
@bancosantander just open-sourced a head start.
Repo is here. 👇
https://t.co/IilShwzvl2
NVIDIA DROPPED SOMETHING BIG FOR AI AGENTS
Nvidia open-sourced a catalog of 110+ verified "agent skills" portable instruction sets that teach ai agents how to use cuda-x libraries and platform tools correctly
→ covers cuopt, nemo, dynamo, rag, deepstream, medical ai, physical ai, and more
→ every skill is signed with an oms signature verifiable against nvidia's trust anchor
→ works with claude code, codex, cursor, and kiro out of the box
→ install any skill in one line: `npx skills add nvidia/skills`
this is capability governance for ai agents not just tools, but verified, auditable instructions that agents can actually trust
https://t.co/WUF6OG9mSN
A huge advantage of model agnostic platforms like Devin is immediate access to the latest models as soon as they ship without having to lift a finger:
GLM 5.2 (still free!)
Kimi K2.7 (still free!)
SWE-1.6 (still free!)
Claude suite including Opus
OpenAI suite including Codex
Gemini models
Deepseek
xAI / Grok
Minimax 2.1
Qwen3
Adaptive which automatically balances intelligence + cost
All in a single subscription and across multiple surface areas including CLI, desktop app, and cloud agents / mobile.
This will only become more advantageous as model diversity continues to expand and open source models continue to improve.
8 RAG architectures for AI Engineers:
(explained with usage)
1) Naive RAG
- Retrieves documents purely based on vector similarity between the query embedding and stored embeddings.
- Works best for simple, fact-based queries where direct semantic matching suffices.
2) Multimodal RAG
- Handles multiple data types (text, images, audio, etc.) by embedding and retrieving across modalities.
- Ideal for cross-modal retrieval tasks like answering a text query with both text and image context.
3) HyDE (Hypothetical Document Embeddings)
- Queries are not semantically similar to documents.
- This technique generates a hypothetical answer document from the query before retrieval.
- Uses this generated document’s embedding to find more relevant real documents.
4) Corrective RAG
- Validates retrieved results by comparing them against trusted sources (e.g., web search).
- Ensures up-to-date and accurate information, filtering or correcting retrieved content before passing to the LLM.
5) Graph RAG
- Converts retrieved content into a knowledge graph to capture relationships and entities.
- Enhances reasoning by providing structured context alongside raw text to the LLM.
6) Hybrid RAG
- Combines dense vector retrieval with graph-based retrieval in a single pipeline.
- Useful when the task requires both unstructured text and structured relational data for richer answers.
7) Adaptive RAG
- Dynamically decides if a query requires a simple direct retrieval or a multi-step reasoning chain.
- Breaks complex queries into smaller sub-queries for better coverage and accuracy.
8) Agentic RAG
- Uses AI agents with planning, reasoning (ReAct, CoT), and memory to orchestrate retrieval from multiple sources.
- Best suited for complex workflows that require tool use, external APIs, or combining multiple RAG techniques.
Most architectures here involve some form of retrieval-time decision. But they all run on top of whatever was already indexed.
If that indexing step outputs messy chunks, every architecture inherits them. Improving it is a separate problem from the 8 above.
My co-founder wrote about a better unit for the indexing step. The technique:
- cuts corpus size by 40x.
- reduces tokens per query by 3x.
- improves vector search relevance by 2.3x.
And it doesn't alter the retrieval algorithm, the reranker, or the embedding model.
Read it below.
Web scraping will never be the same.
(100% open-source visual search at scale)
PixelRAG is a retrieval system that skips HTML parsing completely.
Instead of scraping a page into text and embedding chunks, it screenshots the page and retrieves the image. A vision-language model reads the answer straight off the pixels.
Why that matters: parsing is where web RAG quietly loses information.
- A single HTML-to-text parser can drop 40%+ of a page.
- Tables, charts, and layout get flattened or thrown out.
- Swapping parsers alone can move accuracy ~10 points on the same docs.
PixelRAG indexes the page a person actually sees. The team built a visual index of all of Wikipedia, 30M+ screenshots, and it still beats the strongest text RAG baseline by 18.1% on text-only QA.
The repo also ships a Claude Code plugin that gives Claude eyes.
It lets Claude screenshot any URL and read the rendered page instead of scraping the DOM. So you can hand it a live page, an arXiv paper, or your local site and ask what it actually looks like.
One setup script. No MCP server, no backend.
How the pipeline works:
- Renders each document (web, PDF, image) to image tiles.
- Embeds them with Qwen3-VL-Embedding, LoRA fine-tuned on screenshots.
- Builds a FAISS index and serves a search API.
A stronger reader model lifts accuracy with no re-indexing, since the index is just pixels.
Everything is open-source under Apache-2.0.
GitHub repo: https://t.co/qun9TjAdmw
Talking about RAG, I recently wrote an article on a new approach that makes retrieval much more efficient by cutting corpus size by 40x, reducing tokens per query by 3x, and improving vector search relevance by 2.3x.
The article is quoted below.
🚨 NVIDIA just open-sourced an official skillset for AI agents.
> Every skill NVIDIA-verified and cryptographically signed
> cuOpt, NeMo, CUDA-Q, Dynamo, Medical AI, RAG, and more
> Works with Claude Code, Codex, Cursor, and Kiro
100% Free.
1/
🎁 GLM-5.2 is free to use through Hugging Face Inference Providers for a limited time.
It is one of the hottest open-weight models right now.
I tested it with opencode and it works.
Full config below 👇
🤖 Bring your own AI models to @code!
Connect models from providers you already use, run local models, and choose the right model for every workflow in VS Code.
📖 Read the full post: https://t.co/od5Hb9SX0v
Qwen3.6 35B A3B MTP running locally on my RTX 4070 12GB + 5800x + 64gb Ram.
Peaked at ~60 TPS decode.
That still feels crazy to me.
I’m a developer, but I had basically zero experience with local LLMs when I started. After AI pricing started exploding everywhere, I got curious and began researching what was actually possible to run locally.
That frustration honestly became motivation: I wanted something local, usable, and under my control.
A few months ago I would’ve assumed this kind of model was completely out of reach for consumer GPUs. It’s not perfect, and there are tradeoffs, but it’s absolutely usable.
I honestly think this is the future.
While models like Fable and Mythos are locked away, we still have a real path forward: good open/local models running on hardware people can actually own.
Sharing my command in case it helps someone else with a 12GB card:
llama-server -hf unsloth/Qwen3.6-35B-A3B-MTP-GGUF:UD-Q4_K_XL --host 0.0.0.0 --port ${PORT} --ctx-size 131072 --predict 32768 --batch-size 4096 --ubatch-size 1024 --flash-attn on --cache-type-k q8_0 --cache-type-v q8_0 --mlock --no-mmap --temp 0.6 --top-k 20 --top-p 0.95 --min-p 0.0 --repeat-penalty 1.0 --presence-penalty 0.0 --parallel 1 --metrics --jinja --reasoning on --reasoning-format auto --reasoning-budget 2048 -fitt 1792 -ctkd q8_0 -ctvd q8_0 -ctxcp 32 --no-warmup --spec-type draft-mtp --spec-draft-n-max 2 -ncmoe 29 --chat-template-kwargs "{ \"preserve_thinking\":true}" --swa-full --checkpoint-min-step 512 --no-context-shift -cram 8192 --reasoning-budget-message "Okay, I have thought enough. I will now provide the final answer" --cache-prompt --cache-reuse 256
The future is bright.
I also want to say thanks to the community.
Even though most of you don’t know me, your shared knowledge helped me a lot while I was learning this stuff from scratch.
Special thanks to @outsource_@Tono_Ken3@loktar00@noctus91@witcheer@LottoLabs and many others.
Thanks for sharing what you know.
Best models for your hardware
- 4gb to 12gb vram -
VibeThinker-3B - smokes everything remotely close to its weight class. Challenging 30b models! Last version was also topping math benchmarks
https://t.co/RTchJFFTnV
- 12gb to 24gb vram -
Gemma-12B-coder
Built on top of an already strong model, reduced refusals and 262k context window trained on fable traces https://t.co/DVAhlQ7Y4n
- 24gb to 64gb vram -
Gemma-4-26b-diffusion
This model was already by far one of the most functional and capable models, now it’s hitting 500+ tok/s on consumer hardware! Smart AF made by Google deepmind https://t.co/mSaWPFpgXQ
Cohere North-Mini-Code 30B
A new coding model made by an already impressive lab, its priming worth a shot if you’re looking to test the limits of local coding https://t.co/gDPEj6lPAW
———
For those with 4x 6000s or 3x DGX Spark I think my GLM-5.2-REAP is worth a shot.
Lmk how it goes!
I DELETED CHATGPT 3 MONTHS AGO. HERE'S WHAT I RUN INSTEAD
paying $20/mo to re-explain yourself every single session is insane.
my replacement:
> Obsidian - it remembers, because it's MY files
> MiniMax M3 - reads my entire vault, not 3 paragraphs
> Hermes - an agent that learns me and never resets
now every note I write makes the next answer smarter. it compounds. ChatGPT just forgets you by morning.
the full setup, no API gymnastics 👇
my 8 GB VRAM gaming laptop is absolutely going to hate me for this. but I still did it.
ran a 31b dense model (Gemma 4 31b Q4) with only 8 GB VRAM
last week I ran Gemma 4 26B A4B a mixture of experts model on my RTX 4060 and hit 25–28 tokens/sec using llama.cpp's new MTP support. smooth. snappy.
but MoE has a secret: it only activates 4B parameters per token despite having 26B total. that's why it flies.
so the real question started haunting me. what if I throw a full, no tricks, every parameter fires on every token, 31B DENSE model at the same machine?
# Hardware:
GPU: NVIDIA RTX 4060, 8 GB VRAM
RAM: 16 GB
CPU: Intel Core i7 H
Laptop. Gaming. Modest.
The model: gemma-4-31B-it-qat-UD-Q4_K_XL.gguf
(model's unsloth huggingface link in the comments)
This is Google DeepMind's flagship dense model in the Gemma 4 family that can run on single consumer GPU. It packs a hybrid attention architecture, supports up to 256K context natively, and is QAT (Quantization Aware Training) optimized, meaning it retains far more quality than standard post training quants at the same bit depth. This is NOT the MoE. This is 31 BILLION dense parameters, every single one of them loaded.
# the flags I used:
-m gemma-4-31B-it-qat-UD-Q4_K_XL.gguf -cnv --spec-type draft-mtp --spec-draft-model mtp-gemma-4-31B-it.gguf --spec-draft-n-max 8 --spec-draft-p-min 0.6 -c 6000 -v
Multi Token Prediction (MTP) is still active here. Separate draft GGUF required, same as the 26B setup.
# Results:
→ Decode: ~3 tokens/sec
→ Prefill: ~2 tokens/sec
→ Context: 6000 tokens
→ Hardware crying quietly in the corner: yes
so is 3 tps actually usable?
For real time back and forth chat? Not ideal. You're not having a fluid conversation at 3 tps.
but slow ≠ useless. And this is where it gets genuinely interesting.
think about how senior devs actually work in a real team. But when something is architectural, deeply complex, or needs serious reasoning? they walk down the hall and escalate to the senior.
That's exactly the local AI agent architecture this unlocks:
→ Fast orchestrator model (Gemma 4 26B MoE at 25+ tps) handles routing, simple queries, tool calls, memory. The junior dev.
→ Gemma 4 31B dense is the senior, called only when the fast model genuinely hits a wall. Hard multi step reasoning. Complex code generation. Deep architectural decisions. The agentic loop stays fast. Only the hard hops touch the 31B. That's a legitimate production grade local AI architecture on a budget hardware. (requires 2 8gb gpus)
other workflows where 3 tps is completely fine:
- overnight batch jobs. summarize documents, extract structured data, review code. Fire it off. Sleep. wake up to results.
- One shot deep reasoning
- Silent code audit loops, you write and test, the 31B reviews diffs and flags issues in the background between your sprints
- Any workflow where output quality > output speed
A few weeks ago, nobody was running a 30B+ dense model on a single consumer GPU with 8 GB VRAM. At all. Now we're doing it on an Intel i7-H gaming laptop with a NVIDIA RTX 4060, thanks to llama.cpp + QAT quants + MTP speculative drafting.
Google DeepMind said the Gemma 4 31B targets "consumer GPUs and workstations." They were not exaggerating. The hardware bar to run serious frontier class models locally keeps dropping.
the tools are here. the models are here. you just have to be willing to abuse your laptop a little.
what workflows would you actually run on a local 3 tps 31B dense model? genuinely curious. drop it below.
We recently released Gemma 4, our most capable open models to date. Since then, they’ve been downloaded more than 150 million times. Here’s how three builders are using @GoogleGemma to create apps, platforms, and more.
First up: Builder @measure_plan, who used Gemma 4 to perform visual question answering (VQA) through a specific persona.
🔹By prompting Gemma 4 this way, the model effectively maintained a "medieval bard" character while accurately identifying objects in the room, like “glass of amber liquid” or “shelves with bound tomes.”
The weights are public.
Try it yourself.
Give it the hardest coding or math problem you can think of and reply with the results.
I wanna know if this thing is real.
Paper: https://t.co/H35zarCtxv
GitHub: https://t.co/lyCOerHcoV
Hugging Face: https://t.co/G7iUU6jTXe