@antirez I think you can realize a model with a quantization size similar to iq4xs. This golden quantization point can avoid falling into the quantization point where the ability of 3-digit quantization is rapidly lost, and there is enough space to fit the model.
That's why people using DS4F with DwarfStart, 2 bit quantized, are often surprised by the results. It's not a frontier model but it is not a toy, it is something you can actually use to get work done, and nobody can tell you want to do with it.
@antirez@matteocollina@openclaw For anyone wondering why there's such a massive gap between prefill and gen speeds, it's the classic compute vs memory-bound split. Broke down the mechanics here: https://t.co/BoymHlpBmi