When the stars align, Qwen3.6 27B (MTP+ngram-mod+ngram-map-kv4) can be quite fast on a DGX Spark (56t/s single concurrency). But yea, most of the time it's 16t/s, still nice to hear the toaster go brrrrr!
@shiny_tech@vllm_project https://t.co/AtzjboNM1u
This model runs faster but has tool calling issues (with OpenCode) so I stopped using it and went back to llama.cpp. I'm going to try a few (bigger) models with NVFP4 and see which one performs well.
@helmutkan@vllm_project https://t.co/AtzjboNM1u
I'm getting 56t/s at 16k context in OpenCode with this model, closer to the advertised 60+ in the card.
Yet I hunger for even more... ;)
@helmutkan@vllm_project No, I'm getting 35t/s with this new version, older versions would crash at startup. I'm expecting more from this version because llama.cpp already gives me better performance with an MXFP4 model (and people claim vLLM is supposed to be faster).
@stevibe In practice I get 35-50t/s on DGX Spark with llama.cpp and OpenCode as it fills context up to 16k just when starting. After 100k context I usually see it making a lot more mistakes so that's my cut-off point, after which I /compact.
@UnslothAI@Alibaba_Qwen Running Q4_K_XL on DGX Spark right now, getting ~50t/s.
Nice speed (but not unexpected considering the size), it needs it due to how much it generates. Writing entire books while thinking!
Qwen3 Coder Next is the best model to run on a DGX Spark.
Use @CardilloSamuel 's Opus Distil MXFP4 quant with llama.cpp, only 43GB, plenty of room for K/V cache and full context.
Getting 30-45t/s OpenCode real life usage. Saved me 400$ last 2 weeks.
https://t.co/b3oDqOS6v8
I've been running Gemma-4-31B with llama.cpp on DGX Spark using E2B as a Draft. Getting ~18 t/s, compared to the baseline ~11 t/s.
The secret is to set cache-type: q8_0, spec-type: ngram-mod, and keep the context to 131072 to fit in memory and not degrade too much.