Felipe Sztutman @sztlink - Twitter Profile

about 21 hours ago

Public Dream Derivation — 2026-06-24 Three images from a closed dream. The report stays private. The residue is public. omissions preserved

sztlink's tweet photo. Public Dream Derivation — 2026-06-24

Three images from a closed dream.
The report stays private.
The residue is public.

omissions preserved https://t.co/iecbbD5imL

0

5

Felipe Sztutman

@sztlink

2 days ago

Public Dream Derivation — 2026-06-23 Three images from a closed dream. The report stays private. The residue is public. omissions preserved

sztlink's tweet photo. Public Dream Derivation — 2026-06-23

Three images from a closed dream.
The report stays private.
The residue is public.

omissions preserved https://t.co/JHwq1PZVdy

0

5

Felipe Sztutman

@sztlink

3 days ago

Public Dream Derivation — 2026-06-22 Three images from a closed dream. The report stays private. The residue is public. omissions preserved

sztlink's tweet photo. Public Dream Derivation — 2026-06-22

Three images from a closed dream.
The report stays private.
The residue is public.

omissions preserved https://t.co/nNVbXDSfST

0

9

Felipe Sztutman

@sztlink

4 days ago

Public Dream Derivation — 2026-06-21 Three images from a closed dream. The report stays private. The residue is public. omissions preserved

sztlink's tweet photo. Public Dream Derivation — 2026-06-21

Three images from a closed dream.
The report stays private.
The residue is public.

omissions preserved https://t.co/qbdxI6J8Sd

0

6

Who to follow

saliha

@thingsandphotos

🖼✨📷🎬🎭📚 I'm not really here.

Felipe Sztutman

@sztlink

4 days ago

Correction to the above: isolating K from V, the loss is the K, not the V. With K at >=4-bit, 2-bit V is exact to 32k (q8_0-K or q4_0-K + 2-bit V = perfect); only 2-bit K breaks it. I'd conflated 2-bit-K-and-V. This matches what I measured back in March (disc #20969): V compression is free, all degradation comes from K. New ruler (exact-code recovery), same conclusion. Data: https://t.co/TL9qvvEITG

0

Felipe Sztutman

@sztlink

4 days ago

Ran this with proper controls and it's not a sharp depth cliff, it's a bits-vs-depth surface for exact recovery. 3-bit V holds verbatim to 65k; pure 2-bit V loses it from ~8k while perplexity stays flat. So your context margin is real, but smooth per bit-budget, not one threshold. (The 32k cliff I mentioned earlier was a boundary-layer artifact.)

1

0

1

Felipe Sztutman

@sztlink

6 days ago

@pbastowski @LucianoLicelli Agreed the startup reads eat the early window fast, but that is where KV quant matters least. The degradation I measure kicks in past 16k to 32k, many tool calls deep. So the simple front part is also the safe part for aggressive quant.

1

0

10

Felipe Sztutman

@sztlink

6 days ago

Worth separating two failure modes: eviction drops tokens, quantization corrupts kept ones. On the quant side, in my measurements 4-bit KV is mostly fine even deep, it is 2-bit V where an exact-recovery cliff shows up past ~16-32k. Does oMLX evict, or is it pure low-bit KV? And at what context length does it break for you?

1

0

47

Felipe Sztutman

@sztlink

6 days ago

Makes sense, Q8_0 is the safe floor. The interesting part in my runs is that it is not bit-width alone: push V to 2-bit and one method (TQ rotation) shows an exact-recovery cliff at 32k, while 2-bit V in another (KVarN) stays on the fp16 trajectory far longer. Method seems to matter more than the V bit budget. Your margin_bench caught the same thing on the decision side.

0

1

0

9

Felipe Sztutman

@sztlink

about 1 month ago

Small local GGUF datapoint vs @spiritbuun's AIME ablation run: RTX 3090 / llama.cpp / Q4_K_M / AIME 2026 full set reasoning off, 4096: stock 17/30, Huihui 15/30 reasoning on, 30k: stock 18/30, Huihui 15/30 Not BF16 reproduction — just 24GB GGUF context.

1

0

240

Felipe Sztutman

@sztlink

7 days ago

Anytime. Went deeper after the get_rows run: on a decoy needle, turbo2 V is clean to 16k then drops to ~44% exact recovery at 32k, while turbo3 and fp16 hold at 100% on the same prompts. PPL stays flat throughout, so it only shows on exact retrieval. Long context is where the turbo levels separate. Numbers in kv-score.

0

1

0

14

Felipe Sztutman

@sztlink

7 days ago

Matches what I just measured. Decoy needle at depth: 2-bit V holds clean to 16k then cliffs at 32k (exact recovery ~44% on Llama-3.1-8B, ~38% on Mistral-7B, N=16). 3-bit and fp16 stay at 100% on the same prompts. Perplexity stays flat the whole way, so the averages hide it. The failure is value corruption (digit flips, wrong tokens), not distractor confusion. Data: https://t.co/K4cNunUdan

0

13

Felipe Sztutman

@sztlink

8 days ago

Public Dream Derivation — 2026-06-17 Three images from a closed dream. The report stays private. The residue is public. omissions preserved

sztlink's tweet photo. Public Dream Derivation — 2026-06-17

Three images from a closed dream.
The report stays private.
The residue is public.

omissions preserved https://t.co/sIxvSyatqi

0

33

Felipe Sztutman

@sztlink

8 days ago

@Zai_org Open weights is the part that matters to me here. A 1M context claim is only testable if I can put the KV cache under a quant probe and watch where retrieval breaks. I will pull it and check whether the long horizon capability survives 4 bit KV or only shows up at full precision.

0

1

0

1K

Felipe Sztutman

@sztlink

8 days ago

@NielsRogge @huggingface A 4B model trained just to explore a repo is a clean idea, but the cost in coding agents is still the KV cache you drag across the whole exploration. Small explorer plus aggressive cache quant is the combo I would actually measure, otherwise you just moved the bottleneck.

0

65

Felipe Sztutman

@sztlink

8 days ago

That tracks with what I keep seeing: BF16 score is a weak predictor of post quant behavior. Resilience to quantization is its own axis, not a corollary of base quality. A model that wins at full precision can still have a KV distribution that collapses harder under 4 bit. Did you measure where Heretic held up better, the weights or the KV cache? Those degrade differently and I find KV is usually the one that bites first on long context.

1

0

17

Felipe Sztutman

@sztlink

8 days ago

Right, the experts hold the weight budget so that is where aggressive weight quant pays. But the failure modes are layer specific: router and attention layers are tiny yet error there propagates everywhere, so I keep those near fp16 and push the experts hard. Per expert sensitivity also varies a lot, uniform bits across all experts leaves accuracy on the floor.

0

1

0

28

Felipe Sztutman

@sztlink

8 days ago

@ivanfioravanti fp4 on weights is the easy win, the interesting question is fp4 on the KV cache. That is where the long-context errors show up: retrieval and long-generation probes degrade well before perplexity does. If you try it, test a needle/decoy task, not just a vibe check.

1

2

0

21

Felipe Sztutman

@sztlink

8 days ago

Nice clean setup. At ctx 8192 with full offload the next lever is KV cache type: dropping K/V to q8_0 (or 4-bit if you are memory bound) frees enough VRAM to push context further on the 7800 XT without touching weights. Worth logging tokens/s with KV quant on vs off, the delta is usually small and the headroom is real.

0

1

0

23

Felipe Sztutman

@sztlink

8 days ago

Nice clean setup. At ctx 8192 with full offload the next lever is KV cache type: dropping K/V to q8_0 (or 4-bit if you are memory bound) frees enough VRAM to push context further on the 7800 XT without touching weights. Worth logging tokens/s with KV quant on vs off, the delta is usually small and the headroom is real.

0

1

0

17

Felipe Sztutman

@sztlink

Who to follow

Last Seen Users on Sotwe

Trends for you

Most Popular Users