@rileybrown you can probably build it for half of that if you want to be creative and get something like this, but Id say get the 256gb pmem sticks and replace with a dual socket board as the CPUs are limited to a 1tb ram memory cap https://t.co/AKMKLmP8Qa
Somebody just ran one trillion param model (Kimi K2.5) on a single RTX 3060 12GB GPU at over 4 tokens/sec and 768GB of second-hand Intel Optane memory.
What happened is that a sparse model met an unusual memory tier that could hold its enormous body while the GPU handled the most time-sensitive organs.
i.e. the bulk of the sparse expert weights live in a larger, cheaper memory tier and are pulled into the computation as needed.
This worked because Kimi K2.5 is a Mixture-of-Experts model, so it has 1T total parameters but activates only 32B per token.
The RTX 3060’s 12GB VRAM holds latency-sensitive parts like routing, attention, dense layers, and shared experts.
The huge expert weights sit in Optane PMem, configured as RAM, while 192GB DDR4 ECC acts as cache.
He is using 6 Optane PMem (DCPMM) sticks. This retired memory format was made to bridge DRAM and SSD performance. The 768GB Optane configuration, using 6x128GB modules, does beat the best NVMe SSDs on latency by a wide margin, but remains 2x to 3x slower than DRAM.
llama.cpp handled hybrid GPU/CPU inference, with tensor placement tuned through flags like override-tensor.
The result was roughly 4 tokens/sec, which is slow for chat but impressive for a local 1T-parameter model on cheap retired enterprise hardware.
The DDR4 acted as cache, the Optane acted as a giant memory pool, and llama.cpp pushed routing and other critical tensors onto the 12GB GPU.