We are not trying to win in low interactivity use cases. Our battleground is in premium inference - the best throughput for high interactivity decoding on the biggest models. This is what is needed for agentic AI, and the price premiums are very large (Claude Opus fast mode is $150 per 1M output tokens and it is only around 159 tps/user!)
I would say generally that Vikram’s analysis is all valid but I disagree that our biggest advantage is our large DDR. This has real utility for us for model bundling, but that requires a paradigm shift in how models are built. It also has utility today for newer sparse attention techniques and prompt caching, and this will continue over time as model labs use more data and model sparsity techniques.
But our biggest advantages are 1) we can fuse all operations and all layers within a forward pass into one kernel call, which allows us to fully utilize our HBM bandwidth and 2) we can do this across many sockets, as we can perfectly hide communication behind compute. So we maintain this incredibly high HBM utilization across many chips using a combination of TP, EP and PP. This is something GPUs simply cannot do, so they are forced into these incredibly dense rack form factors and even then cannot push the envelope of decoding interactivity. Groq and Cerebras made everyone think you could only get this very high interactivity by putting everything in SRAM, but we have dispelled that myth, and our large memories allow us to have much higher concurrency and context lengths on the biggest models then they can support as a result, so we can push out the Pareto frontier for high interactivity use cases.
The issues with SN40 are (1) that is was hard to keep up on prefill performance as we relied on BF16 compute and comparatively lower flops to the latest GPUs and (2) the scale up world size was 16 RDUs which limited our ability to utilize our unique ability to scale up linearly, as we can overlap communication and compute, allowing us to push the frontier on interactivity while maintaining very good per-chip throughput. SN50 solves both of these problems. SN50 will have no equal for high interactivity decoding on the biggest models. We will prove this over the coming months. And heterogeneous disaggregated inference means SN40 will also stay very competitive in the coming years by focusing on high interactivity decoding.
I should add, the structural benefits of remaining aircooled, and very low per rack power consumption and weight means we can be deployed into most brown field data centers today. This is huge for us and complements the other advantages above.
Intel and @SambaNovaAI are partnering to deliver rackscale AI infrastructure for inference and agentic workloads, built on Intel Xeon processors and SambaNova RDUs
Watch the Computex keynote: https://t.co/oJltzhSuET
@tunguz You will never see this model anywhere because they need around 12 wafers just to run this at bs1. AA has not put it on the website for this reason. You can say you run big models but you need real evidence and this is not it.
Never has it been clearer that SambaNova has the best architecture in the world for AI inference. This will be proven to be true over the next 6 months.
Thank you @ArtificialAnlys for verifying our speeds, as always 🦾
@MiniMax_ai M2.7 is running the FASTEST on SambaCloud. Try it now: https://t.co/zm6RCXY00n
“We delivered robust Q1 results, reflecting the growing and essential role of the CPU in the AI era and unprecedented demand for silicon,” says Intel CFO @dzinsner. Read more about our business highlights from Q1 2026: https://t.co/tE8kwBdZvi
@Luigi1549898@AlphaSenseInc I can assure you it is wrong, the interviewee does not understand the systems he/she is talking about, nor their tokenomics.
Agreed - you want fungibility especially given the diversity of workloads and model sizes in AI inference. But hyper-specialized chips like Groq LPU, which can only do decoding FFN somewhat efficiently is not the same as SambaNova’s RDU, which can de prefill well and ultrafast decoding extremely well, coupled with other general purpose chips like GPUs that are good at prefill and high throughput, high latency inference, means you still have lots of fungibility, but you get to specialize the majority of your traffic.
Much wrong with your thesis here:
1. Nvidia does not sell one rack, the VR, Vera CPU and Groq LPU are all seperate racks, and all should scale independently, as we are proposing. You want loose coupling.
2. The VR+LPU system is MUCH more Frankenstein system than what we propose, separating attention and FFN is a very bad architectural decision driven by the limitations of the GPU and LPU. This will cause huge utilization issues, and be much harder to program.
3. SambaNova SN50 have more interconnect bw (2.2TB/s) across a larger world size (256 or even up to 512) than Nvidia. KV cache transfer generally speaking is not the problem. And the LPU rack is not integrated into NVLink.
4. Open standards like RoCE are all that is needed for inference, but nothing stopping SambaNova, Intel or any other vendor plugging into NVLink Fusion when needed.
5. SambaNova RDU is nothing like Groq’s architecture. Sambanova was founded in 2017 (a year after Groq) has shipped 5 generations of systems and I am quite sure has more revenue than Groq. SambaNova is also not a sram-only system, it has HBM and directly connect DDR, much larger memory than any other chip, including Nvidia GPUs. You should spend more time understanding SambaNova’s architecture before making these kinds of statements.
6. Intel Xeon’s can run all software that agents are interacting with, Vera does not have anywhere close to the same software coverage.
In general, Nvidia has no solution to low latency agentic inference. The Groq approach will not work, and time will prove this.
What is so different about agentic CPU workloads versus non-agentic software? Nothing has changed other than the same loads run much more. It is still lots of database reads, compiles, automated tests etc. In fact the diversity of tasks is so vast that the most important thing is workload coverage. x86 dominates from this perspective. Investing in a Vera CPU rack only for every SAP HANA call to fail because ARM doesn’t support seems like a bad deal.
We agree with Jensen on one thing: The future is fast, disaggregated inference.
But the path there matters. Nvidia has proposed a fragmented, multi-chip system with strong benchmarks, but it will be inefficient, costly and complicated to deploy.
We take a cleaner approach: Dataflow architecture. Fewer chips. High efficiency. Deployable.
No Frankenstack.
It’s clear– GPUs for prefill. RDUs for decode.
@SambaNovaAI
Someone needs to explain to me how this “magic deterministic compiler” is supposed to handle the inherent dynamism in the most common forms of sparsity.
I.e. in MoEs, you dont know at compile time which token activates which parameters. So are you activating all parameters? This is incredibly wasteful if so. And, we going are going to see more things like this e.g. DSA makes KV cache sparse and therefore dynamic.
The only super fast inference chip that makes sense is @SambaNovaAI .
@ryanshrout@SambaNovaAI@ryanshrout happy to spend some time walking you through the details of SambaNova’s RDUs and what make them special, plus justifying the numbers we disclosed today.