puffin

@puffinancial

buyside tech but i discuss whatever tf i want. best golf score: 84 (twice). this sht ain’t nothing to me man. radius/spybar front right. only ideas not advice

Chicago, IL

Joined June 2022

1K Following

615 Followers

605 Posts

puffin

@puffinancial

2 days ago

@BillKrackman +money max(EV) hedge is (ticket profit)/((odds/100)+1) (60-5)/((170/100)+1) 55/2.7 = 20.37k

puffin

@puffinancial

5 days ago

@KawzInvests the div is suspended FYI and your anthropic % is too low

254

puffin

@puffinancial

6 days ago

$msgs

puffin

@puffinancial

10 days ago

6/7

Who to follow

Guys nothing is serious on this account. quite possibly the next big thing.

14 days ago

@saso_capital are you retarded? this literally makes no sense. AECs are for scale up. “$AAOI’s exposure is concentrated in exactly the undifferentiated short-reach laser module” is completely false. the datasheet for $aaoi 800G transceiver is below where you can clearly see distance = 500m

puffin

@puffinancial

15 days ago

Sundar receives an A+ for the equity raise, but $goog and da 7 must go down

241

puffin

@puffinancial

27 days ago

ok us gov giving $2B to quantum … $intc is right there, this is like 5 high-NA machines we are lighting on fire

278

puffin

@puffinancial

27 days ago

been saying since $30B ARR anthropic is likely EBIT positive. where the fuck is that delusional Ed Zitron guy and his gang of retards now? don’t even have to go off an EBTIT number

877

puffin

@puffinancial

29 days ago

@bubbleboi $ is worthless SPY to 1k

puffin

@puffinancial

29 days ago

fuckkkk just got tapped to degross my semis because some autist wasn’t holding my book as of 3/31 fuckkkk

115

puffin

@puffinancial

29 days ago

unphased by the semis puke today but did buy a thing of zyns and proceed to immediately drop half of them #itsallover

164

puffin

@puffinancial

about 1 month ago

EBTIT (earnings before training, interest & taxes)

149

puffin

@puffinancial

about 1 month ago

Grok got my back: Yes, your speculation is directionally correct—and a sharp intuition, not “retarded” at all.1 Here’s why it holds up in practice for serving (inference) large MoE models: •Sparsity/MoE fundamentally decouples total model size from per-token active compute and memory traffic. A model like DeepSeek-V3 has ~671B total parameters but activates only ~37B per token. The router picks a small subset of experts (typically top-2 or top-8 out of dozens/hundreds). This means the FLOPs and the weights that actually need to be loaded from HBM per forward pass are a small fraction of the total model.2 •Expert Parallelism (EP) + high interconnect bandwidth lets you distribute those experts across many GPUs/nodes. Each GPU only holds and serves its own slice of experts. When a token routes to an expert on another GPU, you do an all-to-all exchange. Larger interconnect bandwidth (NVLink domains, NVSwitch, high-speed fabrics, etc.) makes this exchange cheap and scalable. NVIDIA’s own Wide Expert Parallelism on GB200 NVL72 racks explicitly uses the massive coherent 130 TB/s aggregate NVLink bandwidth to make wide EP practical without blowing up communication overhead.1 •Result: total model size can grow super-linearly relative to per-GPU HBM bandwidth. You add more experts → larger model. You spread them with EP → each GPU’s local expert weights stay roughly the same size. HBM bandwidth demand per GPU therefore stays roughly constant (it’s still dominated by loading its own active experts + activations), not scaling linearly with the overall model size. The extra scale is “paid for” by the interconnect and by the sparsity itself.23 This is exactly how frontier MoE serving works today (Mixtral, DeepSeek, Grok-1 style models, etc.). Without sparsity you’d be forced into either (a) much bigger individual GPUs with proportionally higher HBM bandwidth or (b) heavier tensor parallelism that also hammers interconnect. Sparsity + EP + fat interconnect is the trick that breaks the linear scaling. A couple of real-world caveats (they don’t invalidate the direction) •Memory capacity still scales with total parameters (all experts have to live somewhere in the cluster), but that’s aggregate cluster memory, not per-GPU HBM pressure.21 •Communication can still bite if your interconnect isn’t up to the task—especially in multi-node setups or at huge batch sizes. All-to-all routing traffic is real (Meta has seen it eat 10-30% of latency). But that’s precisely why more interconnect bandwidth is valuable, not a “tax on stupidity.”26 So yeah—you nailed the core insight. The bubble-boi take that extra inter-cluster bandwidth is pointless ignores exactly this dynamic. Sparsity and MoE don’t eliminate bandwidth bottlenecks, but they do let you serve dramatically larger models without a matching linear explosion in per-GPU memory-bandwidth demands. That’s how the industry is actually scaling inference right now.

164

puffin

@puffinancial

about 1 month ago

@bubbleboi thru sparcity tho you can serve a larger model (enabled by larger interconnect) without some equal (linear) increase in memory bandwidth i would imagine? Or am i just retarded

167

puffin

@puffinancial

about 1 month ago

@bubbleboi unless the argument is we should just do more sparcity & RL on the current parameter sized models and scaling laws are dead i don’t really get the tweet man