Alexey Tumanov

5 months ago

LLM inference is quickly becoming a global infrastructure problem. Leveraging AI agents to optimize these systems is the natural way to accelerate their development. But AI driven optimization requires a fast, cheap, and accurate evaluation mechanism. 🧵

1

5

2

0

361

5 months ago

Academic credit for chunk prefill keeps getting missed by those reimplementing it in their own systems. https://t.co/uJAnUx1iXO demonstrated the superiority of chunked prefill mechanism for enabling preemptive scheduling in long context requests (up to 10M tokens).

5 months ago

Chunk pipeline parallelism provides two critical advantages: 1. It scales ridiculously well. you can get 85%+ efficiency even at pp=32, 128 H100! 2. It supports preemption, so you avoid the terrible convoy effect that happens with long context. 10x+ improvement over CP 1/2

agrawalamey12's tweet photo. Chunk pipeline parallelism provides two critical advantages:

1. It scales ridiculously well. you can get 85%+ efficiency even at pp=32, 128 H100!
2. It supports preemption, so you avoid the terrible convoy effect that happens with long context. 10x+ improvement over CP 1/2 https://t.co/G68J0cq1nS

1

7

1

0

419

0

3

0

154

alsched retweeted

Asst. Prof @utexasece, Past: Researcher @VMwareResearch, Postdoc @Stanford CS, PhD @Berkeley_EECS I work on cloud computing systems and ML.

11 months ago

After hitting evaluation puzzles like this in our own work, we analyzed patterns across LLM inference papers and identified 8 systematic evaluation issues that can make performance comparisons misleading. We have compiled a practical evaluation checklist to help avoid these pitfalls. 📄 https://t.co/KV6V24JmLR We're also releasing Veeksha, our comprehensive LLM inference evaluation framework, later this month to help the community design more robust benchmarks! 🛠️ What evaluation issues have you discovered in your systems work? Let's learn from each other's mistakes! @nitinkedi @jayashree2912 @kwatra @thisissouvikk @ramaramjee @alsched @gtcomputing @MSFTResearch @intel

0

5

3

0

597

Who to follow

Neeraja Yadwadkar

@NeerajaJY

Yiying Zhang

@yiying__zhang

Founder and CEO of GenseeAI, Associate Professor of Computer Science at UCSD. LLM serving, AI Workflows, Agents

Ana Klimovic

@anaklimovic

Assistant Professor in Computer Science @ETH Zurich. I work on computer systems. Former Research Scientist @Google. EE PhD from @Stanford. EngSci 1T3 @UofT.

alsched retweeted

Georgia Tech School of Computer Science @gatech_scs

11 months ago

Interesting work on long context inference from @nvidia, where they scale KV parallelism on gb200-nvl72 systems! To learn more about accelerating long context inference and trade-offs between different parallelism dimensions checkout out our paper, Medha: https://t.co/kcHkvNWy6q

0

14

5

4

1K

alsched retweeted

11 months ago

Congratulations 👏 to our faculty who were recognized on the Spring 2025 CIOS Honor Roll for their outstanding teaching and educational impact: Assoc. Prof. Alexey Tumanov and Asst. Prof. Jan Van Den Brand!

0

11

1

0

410

alsched retweeted

Sachit Kuhar @SachitKuhar

about 1 year ago

Full code 🔓 https://t.co/JUN6r1S293 Collaboration with @jinga_lala1 and @alsched. (6/6) #EfficientAI #EdgeAI #Quantization #TMLR #AI #GaTech #GeorgiaTech

0

2

1

0

177

alsched retweeted

about 1 year ago

Super excited to share another incredible systems that we have built over the past two years! Training giant foundation models (like Llama-3 405B) costs a FORTUNE 💰 (millions of dollars)! Optimizing the training "recipe" (parallelism, memory tricks, etc.) is critical but incredibly complex. The wrong choices can waste millions. How do we find the best setup without burning GPUs? This is the problem we tackle with Maya, GPU cluster emulation tool -- have a 1000 GPU job, want to know how it would will perform, all you need is a one cpu and Maya virtual GPU runtime ✨ Arxiv: https://t.co/6GnkCTygpy Code: Coming soon! 🧵

1

21

13

5

2K

alsched retweeted

about 1 year ago

Super long-context models with context window spanning millions of tokens are becoming commonplace (@GoogleDeepMind Gemini, @xai Grok 3, @Alibaba_Qwen Qwen2.5). But efficiently serving these models is tough, especially alongside short requests. Head-of-Line (HOL) blocking becomes a major issue, hurting latency for everyone. We present Medha, a system designed to handle this mix efficiently. Achieving 30x lower latency, and 5x higher throughput compared to the state-of-the-art. Full paper: https://t.co/PQlwLtlnD5. 🧵

agrawalamey12's tweet photo. Super long-context models with context window spanning millions of tokens are becoming commonplace (@GoogleDeepMind Gemini, @xai Grok 3, @Alibaba_Qwen Qwen2.5). But efficiently serving these models is tough, especially alongside short requests. Head-of-Line (HOL) blocking becomes a major issue, hurting latency for everyone.

We present Medha, a system designed to handle this mix efficiently. Achieving 30x lower latency, and 5x higher throughput compared to the state-of-the-art. Full paper: https://t.co/PQlwLtlnD5. 🧵

1

31

14

7

4K

alsched retweeted

about 1 year ago

Maya offers a transparent, accurate, and efficient way to model and optimize large-scale DL training without needing expensive hardware clusters for exploration. A crucial step towards sustainable AI! Read the paper: https://t.co/6GnkCTygpy Work done with @Y_Srihas , @1ntEgr8 , Hakesh Darapaneni, Mitali Meratwal, Shivam Mittal, Pranavi Bajjuri, Srinivas Sridharan, @alsched at @GeorgiaTech @NVIDIA @NVIDIAAI

0

2

1

0

302

alsched retweeted

over 1 year ago

Sequence pipeline parallelism being rapidly adopted for extreme long context inference in the industry! Checkout our paper on system design for long context inference for more details https://t.co/kcHkvNWy6q

agrawalamey12's tweet photo. Sequence pipeline parallelism being rapidly adopted for extreme long context inference in the industry! Checkout our paper on system design for long context inference for more details https://t.co/kcHkvNWy6q https://t.co/3BWQOp8seA

0

19

4

7

1K

alsched retweeted

ACM SoCC @ACMSoCC

over 1 year ago

At SoCC’24, Anastasia Ailamaki from EPFL will give a keynote on how disaggregated memory resources are becoming the norm and how this “new memory wall” affects database system design. This talk will be amazing, make sure to be there!! https://t.co/YfSh1K9vGM

ACMSoCC's tweet photo. At SoCC’24, Anastasia Ailamaki from EPFL will give a keynote on how disaggregated memory resources are becoming the norm and how this “new memory wall” affects database system design. This talk will be amazing, make sure to be there!!
https://t.co/YfSh1K9vGM https://t.co/uLZTRj9cSm

0

10

1

0

542

over 1 year ago

Super-charged technical program this year at @ACMSoCC: https://t.co/LQddwoX5sV Looking forward! Hope to see you there! #socc24

ACM SoCC @ACMSoCC

over 1 year ago

We are just under a month away from SoCC’24! This year’s conference will be from Nov 20-22 at the Microsoft Campus in Redmond, WA . Early bird registration is now open until Nov 6. Make sure to register! https://t.co/aGYRvaDDtA

0

3

0

2K

0

4

0

282

alsched retweeted

over 1 year ago

⚡ Speed Meets Accuracy: Unlike approximation-based methods, Mnemosyne achieves exact inference—ensuring that the generated output remains precise, even when processing 10 million tokens by effectively combining all these parallelization techniques to scale up to hundred of GPUS

agrawalamey12's tweet photo. ⚡ Speed Meets Accuracy:

Unlike approximation-based methods, Mnemosyne achieves exact inference—ensuring that the generated output remains precise, even when processing 10 million tokens by effectively combining all these parallelization techniques to scale up to hundred of GPUS https://t.co/BnwUs9szDF

1

6

2

0

553

over 1 year ago

@samiramanabi Amen to that

0

130

alsched retweeted

over 1 year ago

@Google has silently but surely developed an edge over @OpenAI. Long context processing seems to be the key to Google's AI strategy. NotebookLM is a prime example of what long context processing can unlock. In our latest paper, we talk about how systems can be built to support multi-million context length matching Google's capabilities. In case you missed the paper, here is the NotebookLM generated podcast! Podcast: https://t.co/pnQfxV914S Arxiv: https://t.co/i75NEz7h5F

2

11

4

2

835

alsched retweeted

over 1 year ago

🔗 Curious to learn more? Dive into our paper to explore the technical details behind Mnemosyne: https://t.co/2B125uxm6y. Join work between @gtcomputing, @Microsoft and @UCSDJacobs with amazing Esha Choukse, @alsched, @ramaramjee, @Junda_Chen_, Íñigo Goiri & Chaojie Zhang!

0

9

2

0

553

over 1 year ago

First publicly known support for LLM context of up to 10M tokens with high throughput & interactive production-grade TBT SLOs (30ms) with Mnemosyne. What would it take to pair program with GenAI on millions of LoC? Or analyze 10/110hrs of video/audio content? All precisely! <v>

0

10

0

1K