Robert Hormuth

@rhormuth

Corporate Vice President, Architecture and Strategy, Data Center & Embedded Solutions Business Group at AMD

The Great State of Texas

Joined March 2009

507 Following

500 Followers

976 Posts

rhormuth retweeted

Dylan Patel

@dylan522p

over 1 year ago · Georgia

Met with @LisaSu today for 1.5 hours as we went through everything She acknowledged the gaps in AMD software stack She took our specific recommendations seriously She asked her team and us a lot of questions Many changes are in flight already! Excited to see improvements coming

193

622

597K

rhormuth retweeted

Lisa Su

@LisaSu

over 1 year ago

Thanks @dylan522p for the constructive conversation today. Feedback is a gift even when it’s critical. We have put a ton of work into customer and workload optimizations but there is lots more we can do to enable the broad ecosystem. I appreciate all the feedback and desire to engage with @AMD. We are committed to building a world-class open software stack. Lots planned for 2025. Happy holidays to all!

135

106K

rhormuth retweeted

Jack Huynh

@jackhuynh

over 1 year ago

Excited to announce the next chapter in gaming greatness! The world’s best gaming processor, the @AMD Ryzen 7 9800X3D is here to power our community to new heights. Let’s level up together! 🚀🎮

530

58K

rhormuth retweeted

Yangqing Jia

@jiayq

almost 2 years ago

Memory Matters for LLM. While everyone is rushing to provide the serverless Llama3-405b model, I want to talk about one key choice that matters a lot, especially for dedicated enterprise deployments when traffic is not very high: memory. - The normal deployment of a model the size of 405b takes 8xH100 GPUs with a total of 640G memory. You'll quantize the weights to int8 or fp8, leaving about 230G memory for KV cache and others. Doable with care. - If you need to do fine-tuning (full fine-tuning, or Lora, or Medusa), memory size is going to be stressful. Your choices are probably (1) do quantized training with careful control of scale, (2) go distributed, both require extra care. - @AMD MI-300 is a particularly interesting card for this scenario, as each card has 192G memory - 4 cards with a total of 768G memory will be very comfortably host the model, as well as giving you a good amount of remaining memory for KV cacheing / prompt cacheing and other tricks. - Attached is a screenshot showing our runtime ("tuna") running the 405b model on 4xMI300 out of the box at @LeptonAI. Speed is good. - We know there are a lot of claims out there saying one is faster than the other, but based on our experience, with reasonable quantization, continuous batching, chunked decoding and other known optimization techniques, MI300 and H100 exhibit on-par performance. - We haven't thoroughly tested some of the optimization techniques, such as Medusa, on the 405b models. So it is hard to say for sure which GPU takes the lead. - The upcoming Blackwell GPUs will have 192G memory as well, so we are definitely seeing appetite for larger models. - Large memory definitely gives you opportunity to do more within one box: 1.536TB memory per machine means you can do almost whatever you want with the 405b sized models: fine-tune them, serve multiple models at once, hot swap Loras, etc. It's exciting times for model, and also exciting times for infra.

jiayq's tweet photo. Memory Matters for LLM.

While everyone is rushing to provide the serverless Llama3-405b model, I want to talk about one key choice that matters a lot, especially for dedicated enterprise deployments when traffic is not very high: memory.

- The normal deployment of a model the size of 405b takes 8xH100 GPUs with a total of 640G memory. You'll quantize the weights to int8 or fp8, leaving about 230G memory for KV cache and others. Doable with care.

- If you need to do fine-tuning (full fine-tuning, or Lora, or Medusa), memory size is going to be stressful. Your choices are probably (1) do quantized training with careful control of scale, (2) go distributed, both require extra care.

- @AMD MI-300 is a particularly interesting card for this scenario, as each card has 192G memory - 4 cards with a total of 768G memory will be very comfortably host the model, as well as giving you a good amount of remaining memory for KV cacheing / prompt cacheing and other tricks.

- Attached is a screenshot showing our runtime ("tuna") running the 405b model on 4xMI300 out of the box at @LeptonAI. Speed is good.

- We know there are a lot of claims out there saying one is faster than the other, but based on our experience, with reasonable quantization, continuous batching, chunked decoding and other known optimization techniques, MI300 and H100 exhibit on-par performance.

- We haven't thoroughly tested some of the optimization techniques, such as Medusa, on the 405b models. So it is hard to say for sure which GPU takes the lead.

- The upcoming Blackwell GPUs will have 192G memory as well, so we are definitely seeing appetite for larger models.

- Large memory definitely gives you opportunity to do more within one box: 1.536TB memory per machine means you can do almost whatever you want with the 405b sized models: fine-tune them, serve multiple models at once, hot swap Loras, etc.

It's exciting times for model, and also exciting times for infra.

257

142

51K

Who to follow

Patrick J Kennedy

@Patrick1Kennedy

I do tech things @ServeTheHome Chief Analyst for https://t.co/GA7x3f8Ac9 https://t.co/7sCknLYWql and the Axautik Group LLC

Nemez

@GPUsAreMagic

Helping make GPUs faster, hobby semicon/FPGAs and too much Minecraft. Opinions are my own, nowadays I mostly rant. Also on https://t.co/dulcLVhvfX

AMD Networking

@AMDNetworking

Official Twitter handle for AMD networking solutions.

rhormuth retweeted

Michael Dell 🇺🇸

@MichaelDell

about 2 years ago

Multi cloud including on prem and collocation is the clear trend. Somewhat driven by the growth in AI inference and data gravity, 83% of enterprise CIOs in Barclays survey plan to repatriate at least some workloads in 2024, up from low point of 43% in 2020 H2.

MichaelDell's tweet photo. Multi cloud including on prem and collocation is the clear trend. Somewhat driven by the growth in AI inference and data gravity, 83% of enterprise CIOs in Barclays survey plan to repatriate at least some workloads in 2024, up from low point of 43% in 2020 H2. https://t.co/f6yLK2IVKt

356

75K

rhormuth retweeted

Patrick Moorhead

@PatrickMoorhead

over 2 years ago · San Jose

Now @DellTech with President, GIS (DC and edge), Arthur Lewis: -PowerEdge adding MI300X -8 cards, some wild bandwidth -Using Infinity Fabric too (interesting) - 'focus on out of the box LM simplicity' -'Open for business, taking orders now' (crowd cheers) $DELL

PatrickMoorhead's tweet photo. Now @DellTech with President, GIS (DC and edge), Arthur Lewis:
-PowerEdge adding MI300X
-8 cards, some wild bandwidth
-Using Infinity Fabric too (interesting)
- 'focus on out of the box LM simplicity'
-'Open for business, taking orders now' (crowd cheers)
$DELL https://t.co/4LBBwiQ2Wt

rhormuth retweeted

Karl Freund

@karlfreund

over 2 years ago

@AMD and @Dell say that 83% of data today resides on prem. Makes sense. Dell will support the MI300X on its servers. We think the on-prem enterprise opportunity for AMD MI300 may be greater than the cloud.

karlfreund's tweet photo. @AMD and @Dell say that 83% of data today resides on prem. Makes sense. Dell will support the MI300X on its servers. We think the on-prem enterprise opportunity for AMD MI300 may be greater than the cloud. https://t.co/HuoZ3FqBtL

786

rhormuth retweeted

Ultra Ethernet Consortium @ultraethernet

over 2 years ago

Exciting news 🎉 Ultra Ethernet Consortium (UEC) opens doors to new member applications & new UEC memberships will begin on November 15, 2023! Read our newsletter for more info: https://t.co/YQ2TPSpPiJ #UltraEthernet

433

Robert Hormuth @rhormuth

almost 3 years ago

Given the other guys news….AMD has EPYC and Ryzen options… #amd #epyc #ryzen #embedded #togetherweadvance “Intel Exiting the PC Business as it Stops Investment in the Intel NUC” https://t.co/MsZToxwcUk https://t.co/9KsWdPAFl1

238

Robert Hormuth @rhormuth

almost 3 years ago

Come get a peek under the covers. https://t.co/aO3gO8Ki2l

134

Robert Hormuth @rhormuth

almost 3 years ago

It's been a true honor to join AMD and the leadership of Dr Lisa Su. #leadership #AMD #togetherweadvance https://t.co/tY9scw8SHV

200

Robert Hormuth @rhormuth

almost 3 years ago

New ROCm™ 5.6 Release Brings Enhancements and Optimizations for AI and HPC Workloads AMD to Add ROCm Support on Select RDNA™ 3 GPUs this Fall #amd #ROCm #instinct #radeon #togetherweadvance #ai #ml #inference https://t.co/d3TFOnQHoG

289

Robert Hormuth @rhormuth

almost 3 years ago

Innovation continues, even in refrigerators!!!! Love my new LG, it’s EPYC!!!! Jay Boisseau Jay Kirkland Paul Perez @drjmetz Jaime Edwards Chuck Gilbert @SR_fellow #innovation #LG https://t.co/lxYHgnnLvD

106

Robert Hormuth @rhormuth

almost 3 years ago

With the release of PyTorch 2.0 and ROCm 5.4, we are excited to announce that LLM training works out of the box on AMD datacenter GPUs, with zero code changes, and at high performance (144 TFLOP/s/GPU)! We are thrilled to see promi…https://t.co/GNLN7yVEBa https://t.co/FV0lF15F6L

Robert Hormuth @rhormuth

almost 3 years ago

From Exascale (Frontier) to Exadata (Oracle X10M), AMD provides Exa-Results. Up to 3X more transaction throughput, 3.6X faster analytics, and 50 percent more consolidation compared to previous generation "Available now in both O…https://t.co/tu1kxJj0Yv https://t.co/Sh8VQhdz8R

Robert Hormuth @rhormuth

almost 3 years ago

Please congratulate Paul Artman on his promotion to AMD Fellow. Honored to have Paul on my team.

Robert Hormuth @rhormuth

almost 3 years ago

Anyone remember the theoretical memory BW/Core chart from a few months ago. Well here is a measured version with interfering actors. Showing the power of the EPYC Zen4 chiplet architecture in action. Incredible difference in the shape of the graphs.…https://t.co/rBN1D0rC9O

366

Robert Hormuth @rhormuth

almost 3 years ago

OCP Server Community – On June 21st, the DC-MHS group will have a Public Meeting starting at 0800 Pacific Time. https://t.co/UY4oBs4tdF The meeting will be consist of two sections: Between 0800 and 0830, there will be a recorded call for general updat…https://t.co/BaVuBghYqs

100

Robert Hormuth @rhormuth

almost 3 years ago

"The best defense is a good offense, and as it turns out, the best offense is also a good offense. So while AMD is all polite-like in its presentations, rest assured that with the ever enwidening and embiggening Epyc server chip li…https://t.co/JYbEUXDobF https://t.co/AvV4J4q1bl

11K

Robert Hormuth @rhormuth

almost 3 years ago

Learn more at: https://t.co/pmw3l9I0tn #amd #epyc #togetherweadvance #sustainability #greenIT #energyefficiency https://t.co/P7uzju1mjg

103

Robert Hormuth

@rhormuth

Who to follow

Last Seen Users on Sotwe

Trends for you

Most Popular Users