buyside tech but i discuss whatever tf i want. best golf score: 84 (twice). this sht ain’t nothing to me man. radius/spybar front right. only ideas not advice
@saso_capital are you retarded? this literally makes no sense. AECs are for scale up. “$AAOI’s exposure is concentrated in exactly the undifferentiated short-reach laser module” is completely false. the datasheet for $aaoi 800G transceiver is below where you can clearly see distance = 500m
been saying since $30B ARR anthropic is likely EBIT positive. where the fuck is that delusional Ed Zitron guy and his gang of retards now? don’t even have to go off an EBTIT number
Grok got my back:
Yes, your speculation is directionally correct—and a sharp intuition, not “retarded” at all.1
Here’s why it holds up in practice for serving (inference) large MoE models:
•Sparsity/MoE fundamentally decouples total model size from per-token active compute and memory traffic. A model like DeepSeek-V3 has ~671B total parameters but activates only ~37B per token. The router picks a small subset of experts (typically top-2 or top-8 out of dozens/hundreds). This means the FLOPs and the weights that actually need to be loaded from HBM per forward pass are a small fraction of the total model.2
•Expert Parallelism (EP) + high interconnect bandwidth lets you distribute those experts across many GPUs/nodes. Each GPU only holds and serves its own slice of experts. When a token routes to an expert on another GPU, you do an all-to-all exchange. Larger interconnect bandwidth (NVLink domains, NVSwitch, high-speed fabrics, etc.) makes this exchange cheap and scalable. NVIDIA’s own Wide Expert Parallelism on GB200 NVL72 racks explicitly uses the massive coherent 130 TB/s aggregate NVLink bandwidth to make wide EP practical without blowing up communication overhead.1
•Result: total model size can grow super-linearly relative to per-GPU HBM bandwidth. You add more experts → larger model. You spread them with EP → each GPU’s local expert weights stay roughly the same size. HBM bandwidth demand per GPU therefore stays roughly constant (it’s still dominated by loading its own active experts + activations), not scaling linearly with the overall model size. The extra scale is “paid for” by the interconnect and by the sparsity itself.23
This is exactly how frontier MoE serving works today (Mixtral, DeepSeek, Grok-1 style models, etc.). Without sparsity you’d be forced into either (a) much bigger individual GPUs with proportionally higher HBM bandwidth or (b) heavier tensor parallelism that also hammers interconnect. Sparsity + EP + fat interconnect is the trick that breaks the linear scaling.
A couple of real-world caveats (they don’t invalidate the direction)
•Memory capacity still scales with total parameters (all experts have to live somewhere in the cluster), but that’s aggregate cluster memory, not per-GPU HBM pressure.21
•Communication can still bite if your interconnect isn’t up to the task—especially in multi-node setups or at huge batch sizes. All-to-all routing traffic is real (Meta has seen it eat 10-30% of latency). But that’s precisely why more interconnect bandwidth is valuable, not a “tax on stupidity.”26
So yeah—you nailed the core insight. The bubble-boi take that extra inter-cluster bandwidth is pointless ignores exactly this dynamic. Sparsity and MoE don’t eliminate bandwidth bottlenecks, but they do let you serve dramatically larger models without a matching linear explosion in per-GPU memory-bandwidth demands. That’s how the industry is actually scaling inference right now.
@bubbleboi thru sparcity tho you can serve a larger model (enabled by larger interconnect) without some equal (linear) increase in memory bandwidth i would imagine? Or am i just retarded
@bubbleboi unless the argument is we should just do more sparcity & RL on the current parameter sized models and scaling laws are dead i don’t really get the tweet man