M720q(i3-9100T) with dual 2080Ti 22G Nvlinked cost around $600, might be the only server you’ll need for Qwen27B inference, hit maximum 110tk/s at MTP3 with FP8 weight and 256K ctx (tqk8v4) https://t.co/nx7u93LjiF
Qwen-AgentWorld 35B-A3B is by far the most advanced 30B-level MoE model in terms of real agentic tasks. Even though derived from Qwen3.5 35B-A3B, it is almost par of 3.6 27B, and simply surplus 3.6 35B-A3B in every aspect.
@sakurayukiai@davideciffa We got a sophisticated ipc workaround to address interconnection issue, and the latency between backends is really small (even at my 15-year-old Westmere Xeon), so now we can put any layer shard onto any GPU we want.😎
@fantopy_kai@davideciffa we got a very sophisticated ipc workaround to address the interconnecting issue, the latency between backends is really small even with a low-end cpu. feel free to try and feedback😀
Thanks to @maxweicj now Lucebox speculative inference engine supports using Luce DFlash and DDTree on mixed backends with Amd and Nvidia cards linked together🏎️
#vLLM_2080Ti_Definitive_Edition ready to go, enjoy single request 100+ tok/s of 27B/31B dense at 1/8 cost of RTX 5090. Qwen3.6 27B full feature support (Gemma4 31B as experiemental path). Check https://t.co/Db2KVMwgMQ to boost your RTX 2080Ti now #llm#vllm
@featuringjared@rumgewieselt not recommended as P100 does not support the crucial IDP4A ISA as other Pascal cards do, P40/P10 will be better choice if you do want to try sm61 cards
PP 1841.7 tk/s | TG 101.3 tk/s | Context 735K
2 x #2080Ti 22GB NVlinked run Qwen3.6-27B-AWQ through vLLM TP=2 MTP K=3 KV=tq4nc single request at extraordinary performance! Maximized AI value of the $500 legacy setup.
https://t.co/7DzqWsxUZG
#localLLM@Alibaba_Qwen@vllm_project