@zhyncs42 Congrats! Would you guys share the right config to run Qwen3.6 models on B300s? Would be nice to have some recipes, I couldn't find enough info on Github
@EmrickSini It uses a vLLM plugin actually, that's why it works by just setting an env variable, and it's easy to add it to a docker image: https://t.co/zgU1qlzgJC
https://t.co/orcApi6B7J
You can now easily load various pre-quantized models (block FP8, NVFP4, AWQ/GPTQ/HQQ, certain GGUFs) via a vLLM plugin! You can also run on-the-fly quant as well, easy to use: 1 or 2 flags to enable!
There's a very simply one-shot trick that seems to improve the quality of low-bit weight quantization by quite a bit in some cases: simply reordering the rows.
It doesn't require changing the matmul kernel, only reshuffling the activations.
What kind of FP4 format does the new TPU8 use? MXFP4 quality is pretty poor, and NVFP4 is specific to Nvidia, so I'm guessing it uses a smaller group size (<32) to achieve better quality ? ๐ค
@elliotarledge Did you run it end-to-end in the full-stack vLLM/SGLang? You'll only see real perf issues when you do that and not just the kernel in isolation
Open sourcing something fun from @Dropbox: Witchcraft.
It's a local search engine built in Rust with no API keys or vector DB required.
Think: ColBERT / late interaction style retrieval, but packaged to run locally (perfect for coding agents).
Let's dive in๐