At 3,000 tokens/s, your budget per token is 333 microseconds. A basic kernel launch is 4.5µs. Kog hit 3k tok/s on standard GPUs by dropping PyTorch and writing a persistent monokernel in raw assembly. The CPU scheduler is officially the enemy ✨
@mr_r0b0t@Snixtp Gb10 won’t give you the best results I guess, it does not have optimized nvfp4 tensor cores right? However for that mission you deserve a gb300