@pip_net Thatโs all correct. BUT they have always been terrible at product. Besides basic chat AI and some TPU rental, I have a very hard time believing they will really win anything here.
KV cache shouldn't disappear every time vLLM restarts. With @novita_labs, we're sharing PegaFlow โ a production-grade external KV cache service that plugs into vLLM through the external KV connector interface.
PegaFlow runs as a standalone Rust daemon owning the host KV pool, SSD cache, and RDMA resources. vLLM workers attach via CUDA IPC + gRPC, and cache survives engine crashes, upgrades, and model switches.
In production-oriented evaluations:
๐ 2.15ร faster vLLM startup with a pre-warmed 500 GiB host pool
๐ 56% higher throughput for 8 Qwen3-8B instances sharing one cache
โก 72% higher throughput for DeepSeek-V3.2 MLA TP8 (logical KV stored once, not per rank)
๐ 194 GB/s average remote-read throughput across nodes
Three-level hierarchy: pinned DRAM, remote DRAM over RDMA, local SSD on io_uring. Integrates through the existing `kv_transfer_config` path โ no vLLM source changes.
๐ https://t.co/rf2VmevP7J
AMD launches MI350P, its first PCIe "Instinct" in four years โ packs CDNA 4 GPU with 4.6 PFLOPs AI compute, 144 GB HBM3E at 600W. https://t.co/uLAh7eokph
@akmalnasir@EkonomiMalaysia There is also a huge complexity in the software layer to operate data centers, especially when it comes to AI. That would require local operators though to own the data centers, not foreign hyperscalers.
What workstation / home server hardware do I need to serve Qwen 3.6 27b / Gemma 4 31b with 32k context window for 20 concurrent generations of ~150 output tokens at 10s latency?
CC @DrFriesOfficial@jun_song@__tinygrad__@sudoingX
@brandonjcarl Token costs have stayed constant during this time, even though token production is 50x cheaper on new systems like NVL72. These savings will eventually be passed on to consumers, once sufficient compute supply is available.