@PavloMolchanov Inference engine will be the critical infra for the many-in-one nested models. I am working on the pruning-based compression kv tech integrated in inference engine for local AI.
I went through something similar. Last month, I was looking into KV cache compression and the dLLM tech stack, and I noticed that KV cache uses a huge amount of memory, often wastefully, with a trade-off between memory and compute. I posted about it on social media, but people just laughed at me. 😂
@ParadisLabs Actually, We have compressed the kv cache storage space by thousands times through new inference engine architecture. After more test work to do, we can release it later. CXL solution is hardware scale, but expensive, our solution is algorithm scale, cheap enough.