Great question — and you’re right that this is the failure mode that matters: top-1 accuracy can match while the token distribution drifts, especially on long generations.
Reference: we measured against the FP8 baseline (zai-org/GLM-5.2-FP8) on the same vLLM harness — same image, TP, --kv-cache-dtype fp8, and the glm45/glm47 reasoning+tool parsers — so it’s like-for-like serving. FP8 is our higher-precision reference (2× the bit-width of the 4-bit weights, and the production-standard deployment). Honest caveat: we did not use a full BF16 reference live — BF16 is ~1.49 TB and needs all 8 H200s just to host one replica — so the reference is FP8, not FP16/BF16.
On the “passes quick evals but drifts on long outputs” point specifically, we deliberately didn’t stop at short-answer accuracy:
•Long-horizon agentic — SWE-bench Verified (mini-swe-agent + official swebench.harness grading): full multi-turn trajectories, not single answers. 410/500 = 82.0% vs FP8 411/500 = 82.2% (Δ = one problem). This is the long-output axis that usually exposes drift.
•Long context — RULER @ 32K / 64K (0.832 / 0.841 vs FP8 0.831 / 0.813), plus a needle retrieved from a ~936K-token prompt at 1M serve.
•Full CoT budgets — all reasoning gens at max_gen_toks=16384, not truncated. We actually got bitten by the default truncation first (bogus gsm8k 0.14 / math500 0.03) — exactly the “looked fine on a quick eval” trap — and fixed it before trusting any number.
•Token-distribution gate — next-token KL-divergence + flip-rate vs the FP8 baseline (/v1/completions logprobs), built on “Accuracy is Not All You Need” (NeurIPS 2024), since two models can match on accuracy while their distributions diverge.
So the short-answer evals were the floor, not the verdict — SWE-bench + RULER + the KL/flip gate are what the “matches FP8” claim rests on. Other honest limits: RULER ran at --limit 50/subtask, and the W4A16 mmlu_pro run was cut once the verdict was clear (FP8 full = 0.820 is the reference). If you’ve got a long-output probe that caught drift for you, I’d genuinely like to hear what it was — always looking to harden the gate.
(For visibility: our 8× RTX PRO 6000 (Blackwell) benchmarks are in progress and we’ll post those numbers soon.)
We quantized GLM-5.2 (744B MoE) to 4-bit — and kept its MTP draft head in BF16.
→ Matches the FP8 release on quality
→ Runs on 4×H200 instead of 8
→ Fastest 4-bit GLM-5.2 at int conc: +69–79% vs AWQ / NVFP4 at batch-1, from MTP speculative decoding
👇
https://t.co/QunrvTrmfb