PyTorch Day India 2026 was a builder-focused milestone for open source AI in Bengaluru. Hosted by the @PyTorch Foundation, @IBM, @nvidia, & @RedHat, it provided an unparalleled platform for technical talks and interactive discussions designed to foster knowledge exchange & collaboration with 460 in-person attendees. View the video highlights ▶️ https://t.co/LvJ73fJAdX #PyTorchDayIndia
Can’t believe I get to say this -- deeply honored to be named a 2026 Sloan Research Fellow: https://t.co/P6dQxR2oJQ
Early faculty life is… "hyper-intense": teaching, advising, hiring, papers, grants; and trying to build a lab culture you’ll still be proud of years later. There were many weeks where it felt like we were building the plane mid-flight, burning plenty of midnight oil along the way.
Over the past few years, I’ve been incredibly lucky to work with amazing students and collaborators on a chain of OSS project: Vicuna → Chatbot Arena → vLLM → DistServe → LMGame → FastVideo; each one then pushed forward way further by people far beyond our lab. This award feels less like a finish line and more like fuel for the lab, for our students, and for the next set of systems we haven’t built yet.
A core principle of us is building "open-source research that ships."
At the same time, it’s hard not to feel a mix of excitement + uncertainty + anxiety about where CS is heading. Coding agents are improving so fast that I am feeling the AGI first-handedly. I have gone back to builder mode -- only more productive than ever -- outside of my faculty admin work. I’ve watched friends and colleagues hit numbers that would’ve sounded like science fiction a year ago (e.g., 100+ commits/day).
So what does it mean to “do great computer science” when baseline productivity keeps jumping?
For me, it makes “research that ships” more important, and even raises the bar. The leverage shifts toward taste and problem selection, principled system design, and translating ideas into reliable artifacts. We're excited to keep proving that through real systems people can use!
Deeply grateful to:
- My students and collaborators — for the ideas, execution, and drive.
- @HDSIUCSD , Dean @GuptaUcsd, and my @UCSanDiego colleagues — for building an environment where ambitious work can happen.
- @nvidia and @mbzuai (and other compute sponsors) — for support that helped us move faster and turn ideas into real artifacts. Even as the interface changes, the need for efficient compute and solid infrastructure only grows.
Most of all: credit to the students at @haoailab. You’re the reason any of this is worth doing. Keep building and shipping!
🔥 New Blog: “Disaggregated Inference: 18 Months Later”
18 months in LLM inference feels like a new Moore’s Law cycle – but this time not just 2x per year:
💸 Serving cost ↓10–100x
🚀 Throughput ↑10x
⚡ Latency ↓5x
A big reason? Disaggregated Inference.
From DistServe, our early research system on prefill-decode disaggregation, to today’s production frameworks, disaggregation has become the backbone of modern LLM serving.
So what is disaggregated inference?
Why does the LLM inference community love it?
And how far have we come?
As the inventors of this technique, we take a look back – 18 months later - at how the idea reshaped the landscape and what comes next.
🔗 Read the full story: https://t.co/Kh7e6xq0Gx
🚨 Announcing the Antler India AI Residency — our boldest program yet for India’s most ambitious AI founders.
₹4 Cr in investment, $1M+ in AI perks, and fast-track decisions in 4 weeks.
To Learn more and Apply👇 Last Date: 13 Aug, 2025
Announcing FastVideo V1, a unified framework for accelerating video generation.
FastVideo V1 offers:
- A simple, consistent Python API
- State of the art model performance optimizations
- Optimized implementations of popular models
Blog: https://t.co/lUsBq3Z4gm
What if Studio Ghibli directed Lord of the Rings?
I spent $250 in Kling credits and 9 hours re-editing the Fellowship trailer to bring that vision to life—and I’ll show you exactly how I did it 👇🏼
You might have heard top reasoning models now match AIME gold medalists in 2025 🏅, but watch them crumble in box-pushing Sokoban (倉庫番) from the 80s! 🧩
Again, we put top reasoning models into the game, o3-mini (medium) took the crown, reaching level 4 before tangled with just two boxes. 😵💫
Claude-3.7-thinking managed two levels, Deepseek-R1 cleared one level. Gemini-2.0-flash-thinking solved none.
Reasoning models often waste tokens self-doubting.
Dynasor saves you up to 81% tokens to arrive at the correct answer! 🧠✂️
- Probe the model halfway to get the certainty
- Use Certainty to stop reasoning
- 100% Training-Free, Plug-and-play
🎮Demo: https://t.co/nDNILbJayQ
We are excited to announce our lab's papers at #ICML2024! 🧠✨
Come and discuss our latest research from LLM evaluation to efficient LLM serving & inference! See you there!
1️⃣ Poster: MuxServe: Flexible Spatial-Temporal Multiplexing for Multiple LLM Serving
📍 Location & Time: poster session 1 — Hall C 4-9 #816, 11:30 AM on Tuesday July 23
📜 TL;DR: MuxServe Boosts multiple LLM serving throughput by up to 1.8x through flexible spatial-temporal multiplexing.
2️⃣ Poster: Break the Sequential Dependency of LLM Inference Using Lookahead Decoding
📍 Location & Time: poster session 2 — Hall C 4-9 #411, 1:30 PM on Tuesday July 23
📜 TL;DR: An exact and parallel decoding algorithm that accelerates LLM decoding without needing auxiliary models or data stores.
3️⃣ Poster: Chatbot Arena: An Open Platform for Evaluating LLMs by Human Preference
📍 Location & Time: poster session 3 — Hall C 4-9 #709, 11:30 AM on Wednesday July 24
📜 TL;DR: Chatbot Arena is an open platform for evaluating LLMs based on human preferences through crowdsourced pairwise comparisons, and it’s becoming a widely cited leaderboard for its robust and credible evaluation methods.
4️⃣ Poster: CLLMs: Consistency Large Language Models
📍 Location & Time: poster session 4 — Hall C 4-9 #604, 1:30 PM on Wednesday July 24
📜 TL;DR: We introduce a new family of LLMs optimized for fast Jacobi decoding, achieving a 2.4x to 3.4x improvement in generation speed across multiple benchmarks without compromising quality.
5️⃣ Poster: Online Speculative Decoding
📍 Location & Time: poster session 5 — Hall C 4-9 #605, 11:30 AM on Thursday July 25
📜 TL;DR: OSD improves the efficiency of large language model inference by continuously updating the draft models with user query data, resulting in a significant reduction in latency and an increase in token acceptance rates.
6️⃣ Poster: InferCept: Efficient Intercept Support for Augmented Large Language Model Inference
📍 Location & Time: poster session 5 — Hall C 4-9 #709, 11:30 AM on Thursday July 25
📜 TL;DR: InferCept is the first inference framework for augmented LLMs, efficiently serving LLMs that can query tools, ML models, and virtual environments.
People often see LLMs as sequential decoders, but we show they can be easily adapted as fast parallel decoders!🔥🚀
Announcing consistency LLMs: teaching LLMs to predict the fixed point from any point on its Jacobi decoding trajectory
- LLM can fast forward on token generation.
- 3.4x speedup, no extra cost, no draft model.
Details: https://t.co/oKE7pWf497
Still optimizing throughput for LLM Serving?
Think again: Goodput might be a better choice!
Splitting prefill from decode to different GPUs yields
- up to 4.48x goodput
- up to 10.2x stricter latency criteria
Blog: https://t.co/pVNpYbR7Qq
Paper: https://t.co/n47rFkMZS0