Correction to the above: isolating K from V, the loss is the K, not the V. With K at >=4-bit, 2-bit V is exact to 32k (q8_0-K or q4_0-K + 2-bit V = perfect); only 2-bit K breaks it. I'd conflated 2-bit-K-and-V. This matches what I measured back in March (disc #20969): V compression is free, all degradation comes from K. New ruler (exact-code recovery), same conclusion. Data: https://t.co/TL9qvvEITG
Ran this with proper controls and it's not a sharp depth cliff, it's a bits-vs-depth surface for exact recovery. 3-bit V holds verbatim to 65k; pure 2-bit V loses it from ~8k while perplexity stays flat. So your context margin is real, but smooth per bit-budget, not one threshold. (The 32k cliff I mentioned earlier was a boundary-layer artifact.)
@pbastowski@LucianoLicelli Agreed the startup reads eat the early window fast, but that is where KV quant matters least. The degradation I measure kicks in past 16k to 32k, many tool calls deep. So the simple front part is also the safe part for aggressive quant.
Worth separating two failure modes: eviction drops tokens, quantization corrupts kept ones. On the quant side, in my measurements 4-bit KV is mostly fine even deep, it is 2-bit V where an exact-recovery cliff shows up past ~16-32k. Does oMLX evict, or is it pure low-bit KV? And at what context length does it break for you?
Makes sense, Q8_0 is the safe floor. The interesting part in my runs is that it is not bit-width alone: push V to 2-bit and one method (TQ rotation) shows an exact-recovery cliff at 32k, while 2-bit V in another (KVarN) stays on the fp16 trajectory far longer. Method seems to matter more than the V bit budget. Your margin_bench caught the same thing on the decision side.
Anytime. Went deeper after the get_rows run: on a decoy needle, turbo2 V is clean to 16k then drops to ~44% exact recovery at 32k, while turbo3 and fp16 hold at 100% on the same prompts. PPL stays flat throughout, so it only shows on exact retrieval. Long context is where the turbo levels separate. Numbers in kv-score.
Matches what I just measured. Decoy needle at depth: 2-bit V holds clean to 16k then cliffs at 32k (exact recovery ~44% on Llama-3.1-8B, ~38% on Mistral-7B, N=16). 3-bit and fp16 stay at 100% on the same prompts. Perplexity stays flat the whole way, so the averages hide it. The failure is value corruption (digit flips, wrong tokens), not distractor confusion. Data: https://t.co/K4cNunUdan
@Zai_org Open weights is the part that matters to me here. A 1M context claim is only testable if I can put the KV cache under a quant probe and watch where retrieval breaks. I will pull it and check whether the long horizon capability survives 4 bit KV or only shows up at full precision.
@NielsRogge@huggingface A 4B model trained just to explore a repo is a clean idea, but the cost in coding agents is still the KV cache you drag across the whole exploration. Small explorer plus aggressive cache quant is the combo I would actually measure, otherwise you just moved the bottleneck.
That tracks with what I keep seeing: BF16 score is a weak predictor of post quant behavior. Resilience to quantization is its own axis, not a corollary of base quality. A model that wins at full precision can still have a KV distribution that collapses harder under 4 bit. Did you measure where Heretic held up better, the weights or the KV cache? Those degrade differently and I find KV is usually the one that bites first on long context.
Right, the experts hold the weight budget so that is where aggressive weight quant pays. But the failure modes are layer specific: router and attention layers are tiny yet error there propagates everywhere, so I keep those near fp16 and push the experts hard. Per expert sensitivity also varies a lot, uniform bits across all experts leaves accuracy on the floor.
@ivanfioravanti fp4 on weights is the easy win, the interesting question is fp4 on the KV cache. That is where the long-context errors show up: retrieval and long-generation probes degrade well before perplexity does. If you try it, test a needle/decoy task, not just a vibe check.
Nice clean setup. At ctx 8192 with full offload the next lever is KV cache type: dropping K/V to q8_0 (or 4-bit if you are memory bound) frees enough VRAM to push context further on the 7800 XT without touching weights. Worth logging tokens/s with KV quant on vs off, the delta is usually small and the headroom is real.
Nice clean setup. At ctx 8192 with full offload the next lever is KV cache type: dropping K/V to q8_0 (or 4-bit if you are memory bound) frees enough VRAM to push context further on the 7800 XT without touching weights. Worth logging tokens/s with KV quant on vs off, the delta is usually small and the headroom is real.