Software Engineer @ Meta · Building with open-source LLMs on local hardware (GB10/Gemma/Llama) · Writing about AI infra & developer tools · Kerala → Seattle
Hi, I’m Vaishak.
Software engineer at Meta. Kerala → Seattle.
I’m about to go deep on running LLMs locally — on real hardware, not cloud APIs.
I’ll be documenting everything: the setup, the failures, the benchmarks, and what actually works.
Follow along if that sounds useful. 🧵
TL;DR: Don't run the dense 31B on DGX Spark. Run the 26B MoE.
Start with Q4_K_M + llama.cpp. Switch to vLLM BF16 when you need multi-user. Revisit NVFP4 when NVIDIA closes the software gap.
Full benchmark tables + methodology: https://t.co/f4oPtRpmdt
I ran every inference stack I could find on Gemma 4 + DGX Spark (GB10, 128GB unified memory).
The main finding: model architecture accounts for ~8x of the throughput difference. Stack choice is secondary.
Here's what actually happened 🧵
https://t.co/f4oPtRpmdt
The NVFP4 story isn't over. When NVIDIA ships ARM64 + sm_121 PyTorch + Gemma 4 in a single container, that number should jump substantially.
The MoE architecture advantage is independent of that. 4B active params beats 31B active params at the same bandwidth — full stop.
1/ today we're releasing muse spark, the first model from MSL. nine months ago we rebuilt our ai stack from scratch. new infrastructure, new architecture, new data pipelines. muse spark is the result of that work, and now it powers meta ai. 🧵
@NexaExperience Android Auto on suzuki smartplay just doesn't work. This is a well documented issue in multiple forums, but suzuki does nothing. I recently bought an s-cross and now I'm stuck. Horrible experience.
https://t.co/NuzJq1IMu5
https://t.co/OT3BeZQDpv
https://t.co/5VsGz5JimB