Qwen3.6-27B-MTP at ~61 tok/s. 100k context.
On two *used* RTX 3080 Tis โ not the RTX 3090 everyone benchmarks (24GB, but split across 2 cards on PCIe 3.0 x8/x8, no NVLink).
Running llama.cpp's new MTP speculative decoding. The deep-context bottleneck? Nobody's talking about it. ๐งต
@witcheer Won't hot air from the AIO liquid cpu cooler be sucked right into rtx 5090? Maybe it'll be fine if you don't run sustained 575/600 watt loads.
@JoesInvestments If it's regular 2x3080 you have 20GB of VRAM, which is not quite enough for this quant+MTP. You would need to try one of the q3 versions probably to fit all in. If you have 3080 12GB, then this should work exactly the same.
Qwen3.6-27B-MTP at ~61 tok/s. 100k context.
On two *used* RTX 3080 Tis โ not the RTX 3090 everyone benchmarks (24GB, but split across 2 cards on PCIe 3.0 x8/x8, no NVLink).
Running llama.cpp's new MTP speculative decoding. The deep-context bottleneck? Nobody's talking about it. ๐งต
If you have another agent like Claude Code running from cloud you could ask it to inspect logs of what's happening. It could probably find the cause, do you track temperature of your 3090s, they might be overheating. For me usually this only happens when I run out of memory (OOM).
It looks like your settings are good. I guess q5_k_xl is slower in generation than q4_k_m. You could try --spec-draft-n-max 3 instead of 2, but don't think it will make that much difference. I haven't tested 2 x 5060 ti on 27b myself. I did test a more compressed Qwen3.6-27B IQ3_K_R4 with no mtp on a single 5060 ti and got like 28 t/s which I managed to increase to 33-34 with aggressive memory overclocking: https://t.co/0wGfoTk8LE
@MakJoris@therealazzurro@therealazzurro is getting that with 2 x 5060 ti. But we don't know his exact setup, perhaps there is room for improvement. Your 37t/s is extremely good for 5060 ti! I got 60 for 2 x 3080 ti but it does go down to like 40 at longer context.
@therealazzurro You can probably squeeze out more tokens out of 2x5060 ti 16G. Q5_k_xl gives you better quality for sure. Check a comment above yours -- 2 x 9060XT 16Gb doing 74tps.
@NVenetias Are you running same quant for the model? I am also using q4 compression for kv cache here. With 48gb of vram you should have more than enough to run full 262k context.