AboveSpec

Verified account

@above_spec

Love 3d printing, playing with local llms and learning Claude Code

Ontario, Canada

Joined December 2017

178 Following

1.5K Followers

323 Posts

Pinned Tweet

18 days ago

Qwen3.6-27B-MTP at ~61 tok/s. 100k context. On two *used* RTX 3080 Tis — not the RTX 3090 everyone benchmarks (24GB, but split across 2 cards on PCIe 3.0 x8/x8, no NVLink). Running llama.cpp's new MTP speculative decoding. The deep-context bottleneck? Nobody's talking about it. 🧵

above_spec's tweet photo. Qwen3.6-27B-MTP at ~61 tok/s. 100k context.
On two *used* RTX 3080 Tis — not the RTX 3090 everyone benchmarks (24GB, but split across 2 cards on PCIe 3.0 x8/x8, no NVLink).

Running llama.cpp's new MTP speculative decoding. The deep-context bottleneck? Nobody's talking about it. 🧵

19

214

14

198

18K

6 days ago

@afcmax22 Less EMFs for wired headphones = healthier for your brain. Come on, everyone should know that?

0

8

1

0

8K

8 days ago

@witcheer Won't hot air from the AIO liquid cpu cooler be sucked right into rtx 5090? Maybe it'll be fine if you don't run sustained 575/600 watt loads.

2

1

0

0

204

12 days ago

@TheRussiaBlog @cuggxa Should be something like Добрую мысль ты сказал**** Hello from Toronto!

2

2

0

0

29

Who to follow

Automattikbeats

@AutoMattikBeats

Graduated Full Sail University Logic Pro X Mainly produce HipHop/Trap Hip hop influences Electronic influences Rock influences Reggae influences

Mr. GPUs (Rufus Wright)

Verified account

The globally renowned expert on GPU operations and supply chains. CEO and founder of Distributed Global Computing. Pushing forward the AI revolution!

Verified account

16 days ago

@JoesInvestments If it's regular 2x3080 you have 20GB of VRAM, which is not quite enough for this quant+MTP. You would need to try one of the q3 versions probably to fit all in. If you have 3080 12GB, then this should work exactly the same.

1

1

0

0

86

18 days ago

Qwen3.6-27B-MTP at ~61 tok/s. 100k context. On two *used* RTX 3080 Tis — not the RTX 3090 everyone benchmarks (24GB, but split across 2 cards on PCIe 3.0 x8/x8, no NVLink). Running llama.cpp's new MTP speculative decoding. The deep-context bottleneck? Nobody's talking about it. 🧵

above_spec's tweet photo. Qwen3.6-27B-MTP at ~61 tok/s. 100k context.
On two *used* RTX 3080 Tis — not the RTX 3090 everyone benchmarks (24GB, but split across 2 cards on PCIe 3.0 x8/x8, no NVLink).

Running llama.cpp's new MTP speculative decoding. The deep-context bottleneck? Nobody's talking about it. 🧵

19

214

14

198

18K

16 days ago

@idare split layers

0

0

0

0

25

17 days ago

@1337hero Cable management has never been my strong suit, lol

0

1

0

0

88

17 days ago

If you have another agent like Claude Code running from cloud you could ask it to inspect logs of what's happening. It could probably find the cause, do you track temperature of your 3090s, they might be overheating. For me usually this only happens when I run out of memory (OOM).

0

1

0

0

74

17 days ago

@EvilOni Nice, I did see them on ebay recently. How much did you get them for?

1

0

0

0

71

17 days ago

It looks like your settings are good. I guess q5_k_xl is slower in generation than q4_k_m. You could try --spec-draft-n-max 3 instead of 2, but don't think it will make that much difference. I haven't tested 2 x 5060 ti on 27b myself. I did test a more compressed Qwen3.6-27B IQ3_K_R4 with no mtp on a single 5060 ti and got like 28 t/s which I managed to increase to 33-34 with aggressive memory overclocking: https://t.co/0wGfoTk8LE

2

1

0

0

71

17 days ago

@MakJoris @therealazzurro @therealazzurro is getting that with 2 x 5060 ti. But we don't know his exact setup, perhaps there is room for improvement. Your 37t/s is extremely good for 5060 ti! I got 60 for 2 x 3080 ti but it does go down to like 40 at longer context.

1

0

0

0

42

17 days ago

@MakJoris @therealazzurro Are you using ik_llama.cpp or mainline llama.cpp?

0

0

0

0

26

17 days ago

@ItsmeAjayKV @threejs They are cool, they liked my post when i posted a coding example using three.js! 👍

0

2

0

0

138

17 days ago

@Hihiohoo No just regular 12gb versions

0

0

0

0

40

17 days ago

@Gianluca_Bing Of course

0

0

0

0

99

17 days ago

@therealazzurro You can probably squeeze out more tokens out of 2x5060 ti 16G. Q5_k_xl gives you better quality for sure. Check a comment above yours -- 2 x 9060XT 16Gb doing 74tps.

0

0

0

1

186

17 days ago

@i_okk Amazing! Dual 9060XT must be the most affordable way to get 32gb of Vram!

1

0

0

1

188

17 days ago

@NVenetias Are you running same quant for the model? I am also using q4 compression for kv cache here. With 48gb of vram you should have more than enough to run full 262k context.

1

0

0

0

133

17 days ago

@callegariai Yes, you can do the same on a single 3090 with less power consumption

0

0

0

0

207

17 days ago

@ProofOfPrints Haven't tested it. Ofc claude opus will be superior, but yeah interesting to try claude for planning and local qwen for execution!

1

1

0

0

235

Last Seen Users on Sotwe

Trends for you

Most Popular Users