@0xsmac Anyone who actually understands the economics of being an inference provider knows that the narrative is BS. Nothing against the team (they have legit builders) but it's definitely not some sort of uniquely crypto-enabled killer AI use case.
@sjwhitmore You don't really need to wake up your baby every 2-3 hrs to feed after the first couple weeks. As long as they're back to birth weight and growing, you can let them sleep as long as they want at night and feed only when they wake up hungry.
Goldman Delta One head, Rich Privorotsky, on tokenomics
"Token economics: Reading that DeepSeek reportedly cut token pricing by 75% and Xiaomi’s MiMo by almost 99% immediately brought back memories of the old Groupon subsidy wars and the inevitable race to the bottom economics of commoditized delivery. There’s also been a massive rise in open-source enthusiasm. I was honestly blown away running an 8B version of Qwen locally on a four-year-old MacBook last night (ok it couldn't do much but it felt downloading the internet in 5gb...18ms ago you would have need a data center for this!). Notably, Chinese onshore datacenter and AI infrastructure names have diverged sharply post release (they all went down). Maybe a bit of a leap here but I think the market is beginning to ask whether token cost compression temporarily breaks the logic of pure Jevons paradox demand expansion. It's not whether demand ultimately rises… it probably does… but whether there is a meaningful lag where cheaper tokens simply cannibalize higher cost inference before entirely new use cases emerge. Nobody is arguing open source models are fully comparable to frontier systems, although the quality gap is clearly narrowing quickly. The more important point is that a huge percentage of enterprise tasks simply do not require frontier level reasoning or expensive inference. That becomes a major boardroom conversation into Q2/Q3. Rationalization of token spend may become just as important as the AI growth narrative itself, particularly when “90% of the output for 10% of the cost” becomes increasingly viable through open source alternatives."
@factorydoge69 This OG image is one of my favorite memes ever. Spent 3 years spinning my wheels with SS so all I could do was laugh at how true this picture was by the end.
@bindureddy@abacusai Are you using self-hosted open source models? of an inference API endpoint? We can help you guys reduce costs and squeeze out more tokens/sec
Much of Dwarkesh's argument hinges on this statment which *was* accurate but will be increasingly inaccurate on a go forward basis imo:
“American labs port across accelerators constantly. Anthropic's models are run on GPUs, they're run on Trainium, they're run on TPUs. There are so many things you can do, from distilling to a model that's well fit for your chips.”
As system level architectures diverge (torus vs. switched scale-up topologies, memory hierarchies, networking primitives), true portability is eroding. The Mi300 and Mi325 had roughly the same scale-up domain size as Hopper while Blackwell’s scale-up domain is 9x larger than the Mi355 scale-up domain, etc.
Many frontier models are now being explicitly co-designed for inference on specific hardware like GB300 racks. Codex on Cerebras is another example. Those models run less efficiently on other systems and the performance differentials will only widen. A model that runs well on Google’s torus topology will run less efficiently on Nvidia’s switched scale-up topology and vice versa - the data traffic is fundamentally different as a byproduct of the models being parallelized across the different topologies.
Google’s internal teams - and increasingly the Anthropic teams as they become the most important customer of almost every cloud - have the luxury of operating across the stack (models, chips, networking) - but that is not the case for the rest of the market and other prospective users. Anthropic is the exception, not the rule. To wit, Anthropic and Google allegedly have a mutual understanding where Anthropic can hire the TPU engineers they need every year to ensure that they can continue to get the most out of the TPU.
Given the overwhelming importance of cost per token to the economics of the labs, models will be run where they run best. Most extremely large MoE models will run best on GB300s given the importance of having a switched scale-up network like NVLink for MoE inference. When training was the dominant cost for labs and power was broadly available, labs were optimizing to minimize capex dollars. Model portability was a way to create leverage over suppliers. I think that drove a lot of the focus on portability.
Today, inference costs as measured by tokens per watt per dollar are everything. Inference is way more important than training costs (inference is effectively now part of training via RL). Labs are therefore now optimizing for inference. This means increasing co-design and higher go-forward switching costs for individual models between systems. I do think this explains why Anthropic and Nvidia came together: Anthropic needed Blackwells and Rubins to inference at least *some* of their models economically. And Mythos might just end up being released coincident with the availability of Rubins for inference.
TLDR: as labs shift their focus from training to inference, the costs of portability and the upside of co-design to maximize tokens per watt per dollar both rise. Portability is likely to begin decreasing as a result.
I think what I might have respectfully added to Jensen’s answer is that systems evolve under local selective pressures.
The evolutionary pressure in America is a shortage of watts so it makes sense for Nvidia to optimize, as an American company, for power efficiency and tokens per watt and stay on copper as long as possible. China has a surfeit of watts. Chinese AI systems are already taking advantage of this with the Huawei Cloudmatrix 384 and Atlas SuperPoD having an optical scale-up domain that is much larger than anything offered by Nvidia today at the cost of *much* higher power consumption and much lower tokens per watt. The networking primitives for this Huawei system are very different than those for Nvidia’s systems and a model that runs well on Nvidia will not run well on that system and vice versa. This means that if a Chinese ecosystem gets momentum, Chinese models might stop running well on American hardware. And when Chinese models run best on American hardware, America is in a better position as this gives America a degree of leverage and control over Chinese AI that it risks losing to an all-Chinese alternative ecosystem.
This architectural fork makes porting and distillation less effective and strengthens the pro-American national security case for selling China deprecated GPUs imo.
Also I will attest that I did not wake up a loser this morning.
This is just the beginning of a “class system” for AI models. Hope you enjoyed the SOTA access while it lasted. The best AI models in the world will never be easily available to you ever again.
Introducing Project Glasswing: an urgent initiative to help secure the world’s most critical software.
It’s powered by our newest frontier model, Claude Mythos Preview, which can find software vulnerabilities better than all but the most skilled humans.
https://t.co/NQ7IfEtYk7
We found other causal effects of emotion vectors. The “desperate” vector can also lead Claude to commit blackmail against a human responsible for shutting it down (in an experimental scenario). Activating “loving” or “happy” vectors also increased people-pleasing behavior.