@thdxr@meansoabstractn Lol just completely wrong. Deepseek inference margins are public. And they don’t even have NVL72s that literally give you 10x gains in inference throughput.
@ben_j_todd In what world is inference compute 50% unless you count free users?
Your B200 numbers are way off too, since SemiAnalysis showed DS R1 can get >9k TPS/GPU with NVL72 + PD disagg.
@zephyr_z9@teortaxesTex If you’ve been to KR/JP it’s very obvious they were never going to win just based on culture
Same reason NYC hasn’t produced any notable work in this space
SF has enough non grifters and risk takers still
@teortaxesTex Better hardware = stack more SRAM
Already been tried (Cerebras, Groq, SambaNova)
Diffusion doesn’t improve this if you decode 1 token at a time
@yacinelearning It’s a big hack
The reason GRPO like objectives are unstable with MoE is that a single gradient step can cause you to greatly exceed the clip bound due to rerouting, and then you won’t update to fix it bc you’re clipped. The right way to fix it is to stop clipping like CISPO.