New @OracleCloud Infrastructure benchmarks show NeuralMesh with Augmented Memory Grid delivered:
⚡ 10x more concurrent users
⚡ 10x higher token throughput
⚡ 7x more tokens served per GPU
Take a closer look: https://t.co/0efFUSRCv1
It's exciting to see what YTL AI Cloud is building for Malaysia’s sovereign AI future.
With @WEKA's NeuralMesh, they’re scaling secure, high-performance infrastructure to power next-gen AI, including the country’s first locally developed LLM. https://t.co/oSflqalBHX
If you’re tired of Claude summarizing your convo, all that is about to change when Context Memory Storage comes online.
Agentic AI requires an enormous amount of context and KV cache.
If LLMs constantly get amnesia, it’s hard to get anything done.
Now we can add petabytes of storage for AI to remember EVERYTHING, and this changes tokenomics going forward.
Special thanks to @AccBalanced for in-depth conversations about context memory storage. This article is not possible otherwise.
Read all about it here:
https://t.co/01lVSDrJqZ
The economics of AI has been a big question mark in many investors' minds - What does the value chain look like? How do you model out the ROIC of AI? What would the ROIC look like?
We built up an end-to-end economics stack to answer this question - how we go from a chip’s silicon cost, through full system integration, all the way down to the dollar cost per million inference tokens.(1/4)🧵
Yesterday, @RealJimChanos posited that Tesla’s relatively low capex meant that they were not a serious competitor in real world AI and Robotics.
This is *exactly* the wrong way to look at it and the implications of this fact are actually positive for Tesla IMO.
Tesla’s inference definitionally happens in the car so their customers are effectively paying for the inference compute “capex,” which is now probably the majority of hyperscaler capex spend.
Tesla’s capex might be an order of magnitude higher if they had to synthetically generate relevant driving data in a datacenter. Customer subsidized vertical integration is beautiful.
This is also why at some point Tesla customers will be able to put their cars into a pool of distributed edge compute and earn money when the car is not driving - same way that Akamai and Cloudflare are putting single GPUs in their edge nodes.
The Tesla fleet as the world’s largest, most distributed CDN for AI (and only AI as obviously can’t cache content in cars) is a real possibility. BYD will have a similar opportunity and similar inference cost advantage.
Beyond this significant inference cost advantage, Tesla has the second largest coherent Hopper cluster - behind only xAI - in the world for pre-training. You only need one coherent cluster *if* it is large enough. Coherent cluster size drives capital efficiency for pre-training.
No one has been able to match the xAI and Tesla clusters from a coherence, speed and cost perspective with coherence being the most important. This is why Jensen described their datacenter design and execution as “superhuman.” Should note that Tesla also has an AI4 cluster for post-training or mid-training or whatever we are calling it these days.
Tesla also has a significant data advantage for training Chinchilla optimal FSD models as real world video scales infinitely and this data advantage further lowers their capitalized training cost - less synthetic data generation and 3P data sourcing/labeling vs. labs training LLMs.
This relative capital efficiency as a result of all these advantages - the largest coherent cluster, customers paying for inference, dataset size and ongoing data generation cost - is likely to matter vs. Robotics and FSD competitors who are less capital efficient.
Cost per token is everything for AI. Google is the low cost producer of LLM tokens (with xAI as #2) but Tesla is the lowest cost producer of tokens that matter for FSD and Robotics.
AI is the first time in my career that being the low cost producer has mattered as token quantity effectively drives quality in a reasoning world. I think this dynamic is very underappreciated by the market.
Tesla might very well be outcompeted by an FSD competitor - unlikely from my perspective but anything is possible - but this will not happen because of their relative capex spend.
If LLM inference happened at the edge on phones and PCs as with FSD, hyperscaler capex would be *much* lower. This is the real risk to datacenter spending, not all the value/macro takes. Btw - memory is the biggest winner in this scenario which is years out if scaling laws continue to hold.
Jim is a smart guy but I humbly think his AI takes are misinformed.
Also so strange to me that anyone is focused on AI as a bubble given the extremely obvious quantum and nuclear bubbles where there are loads of equities that can decline 99% and still be overvalued.
“A lot of attention is given to compute, memory and networking in an AI data center.
What gets less attention is the design of high capacity storage for AI workloads.”
How to turn your ai infrastructure from a cost center, to a profit center
It’s about leverage in your data infrastructure at the @weka storage and memory layers, to radically maximize token unit economics:
https://t.co/DkS6YfKMvr
🚨 Live from #RAISESummit: WEKA unveils NeuralMesh Axon—breakthrough storage for exascale #AI.
⚡ 10x faster checkpointing
⚡ 20x faster time-to-first-token
📈 90%+ GPU utilization
Built for LLMs, agentic AI & real-time inference. https://t.co/BwB1Nl9G37
Given the massive - and increasing - importance of test-time compute and post-training RL shown by Grok-4’s absolute dominance, being the low cost producer of tokens is more important than ever. As an aside, this is the first time in my career as a tech investor that being the low cost producer of anything has mattered.
Today, the lowest cost producers of tokens are Google (TPUs) and xAI (largest coherent cluster, lowest capex $ per deployed GPU, almost certainly highest MFU and have made some really smart architectural decisions). I am obviously biased when it comes to xAI.
From a solely technical perspective, having the best scale-up networking and most efficient KV cache offload are most important to both cost and latency for the increasingly large models and context windows. These are the most important axes of competition in AI infrastructure today - not compute. Note that on-package memory bandwidth is most important when you can fit the model on a single chip (@cerebras) but for any really large model that requires multiple packages, scale-up and kv cache offload are most important. As everyone working on ASICs is slowly beginning to understand.
This is why Dynamo and open-sourcing NVLink were both important and smart. The latter could increasingly lead to ASIC share migrating to NVLink partners. Not to mention the natural negotiating benefits of having a second supplier. Likely to see more of these IMHO:
NAND Research looks at WEKA's NeuralMesh, a new AI-native storage architecture built to address the performance, elasticity, and latency demands of real-time inference and agentic AI:
https://t.co/HJZg68tqFk
WEKA just open-sourced the GPUDirect Storage (GDS) integration from its Augmented Memory Grid - now available for the vLLM and LMCache frameworks.
The combination lets you cut TTFT by 20x, and extend KV Cache TTL from an hour to weeks.
WEKA is excited to share this with the open-source community and would love your feedback. Join the conversation in the new hashtag#WEKA-GDS-Integration channel in the vLLM Slack.
https://t.co/YQYIlgVwNW
Oracle Cloud (OCI) saw TTFT for Llama3.1-70B drop from 39 seconds to 2 seconds by using Augmented Memory Grid to extend KV Cache.
https://t.co/I07TNDu9GM
Frustrated with LLM inference latency and token efficiency?
Here's a way to dramatically speed up inference using a KV Cache extension -
-Speeds up Time To First Token by 20x
-Also allows significantly higher token throughput per GPU
https://t.co/ZfBRr3offS
🧠 Not all KV cache solutions are equal — and the difference is critical.
@vLLM_project & @NVIDIA Project Dynamo are setting a new bar. Augmented memory (WEKA) ≠ commodity storage.
Here’s how WEKA delivers memory-class speed for next-gen inference 👇
https://t.co/gVEGNTkvDF