People talk, listen, watch, think, and collaborate at the same time, in real time. We've designed an AI that works with people the same way.
We share our approach, early results, and a quick look at our model in action.
https://t.co/AFJZ5kH7Ku
We arXiv’ed this paper a few months back, and I still find myself thinking about this work a lot: CAD arguably is a direct continuation of our previous DistServe line of work.
Two after-thoughts:
1. For >4 years, training systems have been surprisingly stable... We've had Megatron/DeepSpeed (and now FSDP2) for ages, and in the “classic” pretrain regime (16- 32K context, fairly uniform batches), it’s fair to feel like the remaining wins are incremental. If you counted papers in MLSYS/OSDI, I believe # training papers have declined a lot recently.
But the workload quietly changed: As agents + post-training became the main compute eater, context lengths jumped from already long to “ridiculously long”: 32K → 128K → 256K (some even start to claim 1M), and suddenly the #1 problem isn’t just parallelisms/kernels, but imbalance / stragglers. When one part of the pipeline grows ~quadratically with sequence length while most others are closer to linear, any “colocate everything on the same GPUs” design becomes a straggler source.
2. This naturally leads to the second thought: disaggregation isn’t just for serving.
We’ve talked a lot about P/D disaggregation in serving (DistServe), and AFD-style ideas for MoE. Here we show the same principle applies to training: the core attention compute -- softmax(QKᵀ)V -- is (1) essentially stateless (no trainable params) and (2) surprisingly composable at token granularity with modern kernels (thanks for all kernel developers like flash attention and flash infer). That means you can treat attention less like “a layer you must shard carefully” and more like “a compute service you can schedule.”
So instead of falling into the usual CP/SP rabbit hole (“what’s the perfect sharding scheme to balance this?” as when we think about TP/EP), we decouple the quadratic component, push it onto a pool of attention servers, and then shard/rebatch attention tasks *however* is convenient to equalize compute, even non-uniformly, without losing kernel efficiency. Training is throughput-sensitive (NOT latency-sensitive), so we can be aggressive with pipelining/overlap (ping-pong execution, comm/compute overlap, ) to hide all these overheads in training.
I hope this work provides some new perspectives about how people should think about CP/SP and disaggregation. 😀
🚀 Join us at the SF AIBrix & vLLM Meetup on June 18th at AWS SF GenAI Loft!
Learn from experts at ByteDance, AWS Neuron, and EKS. Discover AIBrix: a scalable, cost-effective control plane for vLLM.
Talks, Q&A, pizza, and networking! 🍕🤝
https://t.co/GZOmjemxJb
A great presence of #SysNet@IllinoisCS at OSDI/ATC last week. 4 papers were presented by @IllinoisCS students and one received a Jay Lepreau Best Paper Award.
It's great to see alumni like @happyandslow, Cong and Yifan who continue to engage with OSDI/ATC after they graduate.
For the technical details on how cloud computing could makes cars safer, check-out our paper:
https://t.co/I3OrUvFweH
Congratulations @pschafhalter, @sukritkalra, and @happyandslow!
Can GPUs in the ☁️ really drive your 🚗 and make it safer?
We have been studying this question and @pschafhalter will present our findings this afternoon @ieeeiros 2023.
Spoiler alert: Yes!
https://t.co/DPUPipst6r
A lot of people get confused why I'd work on something and be skeptical about it at the same time, but I don't understand why anyone would not be skeptical of what they work on? It seems like a scientific obligation to be skeptical
Meanwhile, having parents who work on completely different areas and not having PhDs give you completely different mindset when you approach your problems (and career).