Anton McGonnell

@aton2006

Product @SambaNovaAI #GenerativeAI

Palo Alto

Joined August 2009

497 Following

576 Followers

711 Posts

Anton McGonnell

@aton2006

6 days ago

We are not trying to win in low interactivity use cases. Our battleground is in premium inference - the best throughput for high interactivity decoding on the biggest models. This is what is needed for agentic AI, and the price premiums are very large (Claude Opus fast mode is $150 per 1M output tokens and it is only around 159 tps/user!)

Anton McGonnell

@aton2006

6 days ago

I would say generally that Vikram’s analysis is all valid but I disagree that our biggest advantage is our large DDR. This has real utility for us for model bundling, but that requires a paradigm shift in how models are built. It also has utility today for newer sparse attention techniques and prompt caching, and this will continue over time as model labs use more data and model sparsity techniques. But our biggest advantages are 1) we can fuse all operations and all layers within a forward pass into one kernel call, which allows us to fully utilize our HBM bandwidth and 2) we can do this across many sockets, as we can perfectly hide communication behind compute. So we maintain this incredibly high HBM utilization across many chips using a combination of TP, EP and PP. This is something GPUs simply cannot do, so they are forced into these incredibly dense rack form factors and even then cannot push the envelope of decoding interactivity. Groq and Cerebras made everyone think you could only get this very high interactivity by putting everything in SRAM, but we have dispelled that myth, and our large memories allow us to have much higher concurrency and context lengths on the biggest models then they can support as a result, so we can push out the Pareto frontier for high interactivity use cases. The issues with SN40 are (1) that is was hard to keep up on prefill performance as we relied on BF16 compute and comparatively lower flops to the latest GPUs and (2) the scale up world size was 16 RDUs which limited our ability to utilize our unique ability to scale up linearly, as we can overlap communication and compute, allowing us to push the frontier on interactivity while maintaining very good per-chip throughput. SN50 solves both of these problems. SN50 will have no equal for high interactivity decoding on the biggest models. We will prove this over the coming months. And heterogeneous disaggregated inference means SN40 will also stay very competitive in the coming years by focusing on high interactivity decoding. I should add, the structural benefits of remaining aircooled, and very low per rack power consumption and weight means we can be deployed into most brown field data centers today. This is huge for us and complements the other advantages above.

676

aton2006 retweeted

Pushkar Ranade

@magicsilicon

9 days ago

.@intel Announces New AI Innovations at Computex https://t.co/5g66xGhQRX

124

aton2006 retweeted

Intel

@intel

9 days ago

Intel and @SambaNovaAI are partnering to deliver rackscale AI infrastructure for inference and agentic workloads, built on Intel Xeon processors and SambaNova RDUs Watch the Computex keynote: https://t.co/oJltzhSuET

intel's tweet photo. Intel and @SambaNovaAI are partnering to deliver rackscale AI infrastructure for inference and agentic workloads, built on Intel Xeon processors and SambaNova RDUs

Watch the Computex keynote: https://t.co/oJltzhSuET https://t.co/zvTXiMnUjN

148

28K

Who to follow

Ciarán McBride

@big_fella85

Killeeshil exile living down under... here to give abuse

Damien Murphy

@Murfy888

Queen’s Counsel London.

Andrew McIntosh

@Tosh789

Maguire....and De Gea

Anton McGonnell

@aton2006

22 days ago

@tunguz You will never see this model anywhere because they need around 12 wafers just to run this at bs1. AA has not put it on the website for this reason. You can say you run big models but you need real evidence and this is not it.

Anton McGonnell

@aton2006

22 days ago

@mweinbach Per user or batched? Also, is this counting prefill and decode tokens or just decode?

Anton McGonnell

@aton2006

23 days ago

@insane_analyst @Alex_Intel_ @business @davorVDR @insane_analyst which results look good?

Anton McGonnell

@aton2006

25 days ago

@rwang07 Benchmark it on @SemiAnalysis_ then @cerebras

795

Anton McGonnell

@aton2006

27 days ago

Never has it been clearer that SambaNova has the best architecture in the world for AI inference. This will be proven to be true over the next 6 months.

347

Anton McGonnell

@aton2006

27 days ago

@StockSavvyShay It really doesn’t

aton2006 retweeted

SambaNova

@SambaNovaAI

about 1 month ago

Thank you @ArtificialAnlys for verifying our speeds, as always 🦾 @MiniMax_ai M2.7 is running the FASTEST on SambaCloud. Try it now: https://t.co/zm6RCXY00n

aton2006 retweeted

Intel

@intel

about 2 months ago

“We delivered robust Q1 results, reflecting the growing and essential role of the CPU in the AI era and unprecedented demand for silicon,” says Intel CFO @dzinsner. Read more about our business highlights from Q1 2026: https://t.co/tE8kwBdZvi

intel's tweet photo. “We delivered robust Q1 results, reflecting the growing and essential role of the CPU in the AI era and unprecedented demand for silicon,” says Intel CFO @dzinsner. Read more about our business highlights from Q1 2026: https://t.co/tE8kwBdZvi https://t.co/QxSPfHPMsi

164

22K

Anton McGonnell

@aton2006

about 2 months ago

@Luigi1549898 @AlphaSenseInc I can assure you it is wrong, the interviewee does not understand the systems he/she is talking about, nor their tokenomics.

Anton McGonnell

@aton2006

2 months ago

Agreed - you want fungibility especially given the diversity of workloads and model sizes in AI inference. But hyper-specialized chips like Groq LPU, which can only do decoding FFN somewhat efficiently is not the same as SambaNova’s RDU, which can de prefill well and ultrafast decoding extremely well, coupled with other general purpose chips like GPUs that are good at prefill and high throughput, high latency inference, means you still have lots of fungibility, but you get to specialize the majority of your traffic.

204

Anton McGonnell

@aton2006

2 months ago

@tunguz You could not be more wrong

Anton McGonnell

@aton2006

2 months ago

Much wrong with your thesis here: 1. Nvidia does not sell one rack, the VR, Vera CPU and Groq LPU are all seperate racks, and all should scale independently, as we are proposing. You want loose coupling. 2. The VR+LPU system is MUCH more Frankenstein system than what we propose, separating attention and FFN is a very bad architectural decision driven by the limitations of the GPU and LPU. This will cause huge utilization issues, and be much harder to program. 3. SambaNova SN50 have more interconnect bw (2.2TB/s) across a larger world size (256 or even up to 512) than Nvidia. KV cache transfer generally speaking is not the problem. And the LPU rack is not integrated into NVLink. 4. Open standards like RoCE are all that is needed for inference, but nothing stopping SambaNova, Intel or any other vendor plugging into NVLink Fusion when needed. 5. SambaNova RDU is nothing like Groq’s architecture. Sambanova was founded in 2017 (a year after Groq) has shipped 5 generations of systems and I am quite sure has more revenue than Groq. SambaNova is also not a sram-only system, it has HBM and directly connect DDR, much larger memory than any other chip, including Nvidia GPUs. You should spend more time understanding SambaNova’s architecture before making these kinds of statements. 6. Intel Xeon’s can run all software that agents are interacting with, Vera does not have anywhere close to the same software coverage. In general, Nvidia has no solution to low latency agentic inference. The Groq approach will not work, and time will prove this.

Anton McGonnell

@aton2006

2 months ago

What is so different about agentic CPU workloads versus non-agentic software? Nothing has changed other than the same loads run much more. It is still lots of database reads, compiles, automated tests etc. In fact the diversity of tasks is so vast that the most important thing is workload coverage. x86 dominates from this perspective. Investing in a Vera CPU rack only for every SAP HANA call to fail because ARM doesn’t support seems like a bad deal.

Anton McGonnell

@aton2006

3 months ago

We agree with Jensen on one thing: The future is fast, disaggregated inference. But the path there matters. Nvidia has proposed a fragmented, multi-chip system with strong benchmarks, but it will be inefficient, costly and complicated to deploy. We take a cleaner approach: Dataflow architecture. Fewer chips. High efficiency. Deployable. No Frankenstack. It’s clear– GPUs for prefill. RDUs for decode. @SambaNovaAI

aton2006's tweet photo. We agree with Jensen on one thing: The future is fast, disaggregated inference.

But the path there matters. Nvidia has proposed a fragmented, multi-chip system with strong benchmarks, but it will be inefficient, costly and complicated to deploy.

We take a cleaner approach: Dataflow architecture. Fewer chips. High efficiency. Deployable.

No Frankenstack.

It’s clear– GPUs for prefill. RDUs for decode.

@SambaNovaAI

181

Anton McGonnell

@aton2006

3 months ago

Someone needs to explain to me how this “magic deterministic compiler” is supposed to handle the inherent dynamism in the most common forms of sparsity. I.e. in MoEs, you dont know at compile time which token activates which parameters. So are you activating all parameters? This is incredibly wasteful if so. And, we going are going to see more things like this e.g. DSA makes KV cache sparse and therefore dynamic. The only super fast inference chip that makes sense is @SambaNovaAI .

Anton McGonnell

@aton2006

4 months ago

@ryanshrout @SambaNovaAI @ryanshrout happy to spend some time walking you through the details of SambaNova’s RDUs and what make them special, plus justifying the numbers we disclosed today.

101

Anton McGonnell

@aton2006

Who to follow

Last Seen Users on Sotwe

Trends for you

Most Popular Users