@croll83@LefterisJP Agreed, but their right comparison is with DS4 or Kimi and not with Qwen 35B. Also requiring those top models for 90% seems like a bit of a stretch. With a good plan from those models, in coding at least, I find Qwen to be more than just a toy donkey 😜
@croll83@LefterisJP I'm running a 4bit Qwen3.6 35B at 100k context, 100k per seq, max seq at 4, and mtp2. With this setup running two or three agents parallelly for different tasks,able to get ~70-90 t/s. Per agent streaming is smooth; Prefill so at 65k that TTFT doesn't drag at all!
I have serious doubts about running more than 30-35B models on DGX Spark... Using Nvidia vllm @ c100k & 8 seq. Able to run only 3 or 4 agents parallelly with ~110t/s on 4-bit Qwen 35B-A3B-MTP; but it occupies >100GB RAM.
How are people running DS4 on their DGX?? Llama?? 🤷🏻
Llama.cpp adding MTP for Qwen Models maybe that is the direction the open source is going. I have yet to give a try to @AtlasInference recipe, I had concerns over high concurrency. Will it give it a try tomorrow to judge the quality on similar 35B bench.
I've tried both Qwen3.6(s)... @SpaceTimeViking 27B and 35B PrismaQuant recipe from @spark_arena. Default configs.
I must say the local inference has made tremendous progress.
However, DFlash on 27B imho was bad. But MTP on 35B had much higher and consistent results. With...l
I didn't factor-in while buying DGX Spark that running local AI would cost me more on Data. The Indian ISP's unlimited plans are all just a scam. Go with @airtelindia at least they give you 3.3k GB over @reliancejio's 1k GB per month.
@mr_r0b0t@NVIDIAAI I'm getting a good 60+ average with the prismaquant 4bit variant of the Qwen 3.6 35b A3B receipe available on @spark_arena. So far amongst various dflash and mtp I have run. This one model has given me the most consistent performance. I had
I'm currently testing the @NousResearch hermes locally with the DGX Spark (msi); and so far - It is killing it ⭐💖🥺!
Model: Qwen 3.6 35B
I have high hopes for it!
I'm on a 30 Mbps plan and the @reliancejio is charging the remaining 14 days of bandwidth for upgrading the plan to 500 Mbps; and even then it will activate after 3 days. What the heck!
We built a world where people work harder than ever, trust less than ever, own less than ever, and somehow we’re all expected to smile through corporate slogans, political theater, algorithmic addiction, and collapsing attention spans like this is peak civilization.
@sudoingX On many coding/tool-call benchmarks 3.6 27B is shown to be superior or similar to 120B. Even with a REAP of 120b; if A11B doesn't give considerable throughput improvements; 27B might be better.
Will do the comparison once I get my GB10 🫢
@Bhavani_00007 M5 Pro. Better chip and thermals. For the LLM that it can fit, you'll get 2-4x prefill and modest tgen boost. Don't even think about Air if you have the budget.
@spark_arena@mr_r0b0t@NVIDIAAI True, but that's only peak tg123 with no ctx. Most people require sustained t/s.
Personality, ctx_tg @ d32k or d16k is a sweet-spot for agentic tasks and there decode falls to 50-70 t/s, which is decent.
Imo the image gives more realistic numbers for most real world use cases.
Made this for everyone who is working with a @NVIDIAAI DGX Spark (GB10) ⚡️
Definitely also bookmark the official site, it's a fabulous resource with playbooks for nearly everything you'd want to see!
https://t.co/uAxkSvIbWG
Here's how I went from 23 tok/s to 79 tok/s on my GX10 (DGX Spark) on Qwen3.6-35B-A3B by changing some configs, parameters and firmware upgrades.
I scoured nvidia forums and x so you don't have to...
Got Qwen 3.6 35B-A3B MoE running at ~65 tok/s (c=1) and ~121 tok/s (c=4) aggregate on my Asus GX10 (dgx spark).
Model stack:
• Target: Qwen/Qwen3.6-35B-A3B-FP8
- Drafter: z-lab/Qwen3.6-35B-A3B-DFlash
• Spec decode: DFlash, 10 speculative tokens
• Context: 200k
- KV cache: bf16/auto, not fp8
Used vllm for this (see flags below)
a week with the dgx spark, here is what is on it and what i have measured so far. nobody is really talking about this machine and it is quietly becoming the workhorse of my whole stack.
hardware: nvidia gb10 sm_121, 124 gb unified lpddr5x at 273 gb/s, cuda 13.0
models on disk (305 gb total, 9 ggufs):
> qwen 3.6 27b q4_k_m / q5_k_m / q8_0 / ud-q4_k_xl
> nemotron 3 omni 30b-a3b q4_k_m / q8_0 / ud-q6_k / ud-q6_k_xl
> deepseek v4-flash 158b q4_k_m (112 gb, flagship 128gb-tier test)
terminal + shell environment:
> zsh + oh-my-zsh + powerlevel10k theme
> modern cli stack: bat, eza, ripgrep, fd, git-delta, tldr, neovim, fzf, autojump
> 6 tmux sessions actively running for parallel agent work
ml + agent stack:
> llama.cpp built sm_121 against cuda 13
> uv + venv ml stack with pytorch 2.11.0+cu130 (aarch64) + transformers + diffusers + accelerate
> hermes agent v0.11 with codex auth bridge
> opencode for free-model overnight research
> telegram gateway routing to nemotron q8 right now
speeds verified so far:
- nemotron 30b-a3b q8: 56 tok/s gen, 1,300 tok/s prefill, 96% gpu, 33gb in unified
- qwen 27b dense q4: 40 tok/s consistent
90+ gb of unified memory still free. deepseek v4-flash 158b loading next as the real flagship test, multimodal omni testing once mmproj pulls, comfyui install in flight for the diffusion lane.
honestly curious what the actual limit is on this box, i have not hit it yet.
Everyone's comparing the DGX Spark to a 5090 and calling it slow.
I think that's the wrong comparison.
I ran Qwen3.6 35B-A3B FP8 with the full 262K context window enabled (~96GB RAM) — something gaming GPUs can't really do.
Results:
🟢No context: 51.3 tok/s, TTFT 110ms
🟣200K prefill: 34.6 tok/s, TTFT 85s (~2,341 tok/s prefill)
Prefill is way faster than a Mac. And 35 tok/s deep into 200K context, on a model this strong, is genuinely usable.
The Spark plays a different game.