@banteg That's about right, i get ~24 tps with kimi-k2. Usable for just me; have to manage context caching really well and researching a codebase can take some time as prefill is at ~45 tps. I still use Claude but Maximum sovereignty forced my hand to have this at home : ). Cost ~22k.
Ha thanks! Self sovereign maxy : )... had to find a way to own the intelligence at home lol. I think most don't realize you can do this with large MOEs. You can keep the router + attention layers on the GPUs and offload the expert weights to CPU/system RAM. Then maximize CPU memory bandwidth with as many channels as possible (12 channels here, fully populated). Not many boards support 12 channels yet though. So was ether something like this or 200k+ to run them.
Prompt processing is the main issue, continuous sessions with good cashing works at usable speed... but doing things like mid session context compression like opencode does by default isn't worth it.
Lol well the above are more important, im not one for PC aesthetics but i had to modify the case for the server board and waterblock the CPU and GPUs to get it all to fit. Wanted to make sure I could run the largest LLMs into the future even if there a bit slow but at usable speed
@Marslauncher@mattsilv@UnslothAI@MiniMax_AI I run 2 5090s + epic with 768gb full 12 channels of ram and run kimi-k2 Q4 ~full precision at ~23 t/s. Thinking i may get 40 to 60 running M3 at Q8 with its new structure. Will test when the PR lands in llama.cpp.
@xcoldplunge 100% im 98% ETH, not a small amount... Im there with you! Front lines fighting for real digital freedoms without Ethereum the world outlook is bleak. Don't think most people understand what an ubiquitous ethereum Network looks like where not everything you build is financial. ZK!
@MicahZoltu@LefterisJP Correction: 162 tok/s single-stream with MTP spec decode on. 2x 5090 TP=2, vLLM 0.20. FP8 weights + FP8 KV cache, MTP spec decoding for Qwen2.6 27b... Maybe it was the 35B-A3 where I was hitting above 200 tps. Would have to benchmark for that again.
@MicahZoltu@LefterisJP Running an EPYC 9475F (Turin, 48c) on a Gigabyte MZ73-LM2 with 12x 64GB DDR5-6400 RDIMM (768GB, all 12 channels filled on socket 0). For Kimi I'm on a 4-bit GGUF in llama.cpp with 4-bit KV cache, getting close to full performance since Kimi was natively trained on INT4.
@dcinvestor I think every utterance of language is fallible to the perceived logic of the next. I dont think we can remove that property and still see benefits from something trained on language. However, maybe it just debates/gass lights us into submission, like an abusive relationship.