Thank you for pointing out the inadequacy of current SOTA implementations of long context support.
Our team has demonstrated the superiority of chunked prefill mechanism for enabling preemptive scheduling in support of long context requests (10M tokens):
https://t.co/uJAnUx1iXO
Chunked pipeline parallelism is arguably the most general and scalable system technique for accelerating super-long-context inference.
It remains underrated today, largely because there still isn’t a strong, high-quality open-source implementation.
The SGLang team recently fully optimized it and published a detailed blog post explaining all the key details.
LLM inference is quickly becoming a global infrastructure problem. Leveraging AI agents to optimize these systems is the natural way to accelerate their development. But AI driven optimization requires a fast, cheap, and accurate evaluation mechanism. 🧵
Academic credit for chunk prefill keeps getting missed by those reimplementing it in their own systems.
https://t.co/uJAnUx1iXO
demonstrated the superiority of chunked prefill mechanism for enabling preemptive scheduling in long context requests (up to 10M tokens).
Chunk pipeline parallelism provides two critical advantages:
1. It scales ridiculously well. you can get 85%+ efficiency even at pp=32, 128 H100!
2. It supports preemption, so you avoid the terrible convoy effect that happens with long context. 10x+ improvement over CP 1/2
After hitting evaluation puzzles like this in our own work, we analyzed patterns across LLM inference papers and identified 8 systematic evaluation issues that can make performance comparisons misleading. We have compiled a practical evaluation checklist to help avoid these pitfalls.
📄 https://t.co/KV6V24JmLR
We're also releasing Veeksha, our comprehensive LLM inference evaluation framework, later this month to help the community design more robust benchmarks! 🛠️
What evaluation issues have you discovered in your systems work? Let's learn from each other's mistakes!
@nitinkedi@jayashree2912@kwatra@thisissouvikk@ramaramjee@alsched@gtcomputing@MSFTResearch@intel
Interesting work on long context inference from @nvidia, where they scale KV parallelism on gb200-nvl72 systems! To learn more about accelerating long context inference and trade-offs between different parallelism dimensions checkout out our paper, Medha: https://t.co/kcHkvNWy6q
Congratulations 👏 to our faculty who were recognized on the Spring 2025 CIOS Honor Roll for their outstanding teaching and educational impact:
Assoc. Prof. Alexey Tumanov and Asst. Prof. Jan Van Den Brand!
Super excited to share another incredible systems that we have built over the past two years! Training giant foundation models (like Llama-3 405B) costs a FORTUNE 💰 (millions of dollars)! Optimizing the training "recipe" (parallelism, memory tricks, etc.) is critical but incredibly complex. The wrong choices can waste millions. How do we find the best setup without burning GPUs?
This is the problem we tackle with Maya, GPU cluster emulation tool -- have a 1000 GPU job, want to know how it would will perform, all you need is a one cpu and Maya virtual GPU runtime ✨
Arxiv: https://t.co/6GnkCTygpy
Code: Coming soon! 🧵
Super long-context models with context window spanning millions of tokens are becoming commonplace (@GoogleDeepMind Gemini, @xai Grok 3, @Alibaba_Qwen Qwen2.5). But efficiently serving these models is tough, especially alongside short requests. Head-of-Line (HOL) blocking becomes a major issue, hurting latency for everyone.
We present Medha, a system designed to handle this mix efficiently. Achieving 30x lower latency, and 5x higher throughput compared to the state-of-the-art. Full paper: https://t.co/PQlwLtlnD5. 🧵
Maya offers a transparent, accurate, and efficient way to model and optimize large-scale DL training without needing expensive hardware clusters for exploration. A crucial step towards sustainable AI!
Read the paper: https://t.co/6GnkCTygpy
Work done with @Y_Srihas , @1ntEgr8 , Hakesh Darapaneni, Mitali Meratwal, Shivam Mittal, Pranavi Bajjuri, Srinivas Sridharan, @alsched at @GeorgiaTech@NVIDIA@NVIDIAAI
Sequence pipeline parallelism being rapidly adopted for extreme long context inference in the industry! Checkout our paper on system design for long context inference for more details https://t.co/kcHkvNWy6q
At SoCC’24, Anastasia Ailamaki from EPFL will give a keynote on how disaggregated memory resources are becoming the norm and how this “new memory wall” affects database system design. This talk will be amazing, make sure to be there!!
https://t.co/YfSh1K9vGM
We are just under a month away from SoCC’24! This year’s conference will be from Nov 20-22 at the Microsoft Campus in Redmond, WA . Early bird registration is now open until Nov 6. Make sure to register! https://t.co/aGYRvaDDtA
⚡ Speed Meets Accuracy:
Unlike approximation-based methods, Mnemosyne achieves exact inference—ensuring that the generated output remains precise, even when processing 10 million tokens by effectively combining all these parallelization techniques to scale up to hundred of GPUS
@Google has silently but surely developed an edge over @OpenAI. Long context processing seems to be the key to Google's AI strategy. NotebookLM is a prime example of what long context processing can unlock. In our latest paper, we talk about how systems can be built to support multi-million context length matching Google's capabilities. In case you missed the paper, here is the NotebookLM generated podcast!
Podcast: https://t.co/pnQfxV914S
Arxiv: https://t.co/i75NEz7h5F
🔗 Curious to learn more? Dive into our paper to explore the technical details behind Mnemosyne: https://t.co/2B125uxm6y. Join work between @gtcomputing, @Microsoft and @UCSDJacobs with amazing Esha Choukse, @alsched, @ramaramjee, @Junda_Chen_, Íñigo Goiri & Chaojie Zhang!
First publicly known support for LLM context of up to 10M tokens with high throughput & interactive production-grade TBT SLOs (30ms) with Mnemosyne. What would it take to pair program with GenAI on millions of LoC? Or analyze 10/110hrs of video/audio content? All precisely! <v>
@Sriraam_UTD love this, thank you for sharing! If only we could have a dataset that would capture "time to acceptance" (including infinity) for all the ML arxiv papers out there, doing some correlation analysis could be insightful.