Running Gemma-4-26B-A4B-NVFP4 on my DGX Spark GB10 via vLLM.
Results: ~30 tok/s single-stream, 53 tok/s at c=2.
For context: Nemotron-3-Super 120B on the same hardware does 14 tok/s.
The MoE architecture is the reason — only 3.8B of 26B parameters activate per token step. Memory bandwidth stops mattering when you’re barely touching the weights.
18 GB weights. 2-min cold start. Full 128k context, parallel tool calls. Native image, audio, and video support. Should I try FP8 + MTP?
@0xSero I had a shocking Orwellian experience yesterday where Claude got righteous with me. It scared the crap out of me. No doubt it would have contacted my boss if it could’ve. I’m unplugging from Anthropic. Open source will save humanity. No doubt in my mind.
@0xSero I’m assuming you mean you can remote access a codex or claude session running on a local model. It’s really cool. I have had issues copying text on mobile though.