Introducing JetSpec: we find speculative decoding can push LLM generation latency to extreme by co-optimizing drafting cost and drafting quality with causal parallel tree drafting.
JetSpec reaches up to 9.64x end-to-end speedup on MATH-500 and 4.58x on open-ended chat while keeping lossless. With CUDA graph and kernel optimizations, JetSpec further translates to around 1000 TPS on a single B200. ⚡️
Check out our project page for demos and a blog post on how we built it 👇
https://t.co/M4T8jOBWQ8
https://t.co/h9uipDbTuh
Digital agent learning needs massive rollouts. But digital agent rollouts are painfully slow due to heavy environments. 🐌
🚀 We introduce NanoRollout, a lightweight open infra (900 lines core code) for digital agent rollout at scale, validated with three workloads:
🏋️ Large batchsize (4K) SWE Agent RL -> surpasses DeepSWE-32B
🧪 250k+ distilled coding trajectories -> SOTA ≤32B open coding agent
⚡ Fast evaluation on coding/cua/unified agent -> finish
Check our Blog: https://t.co/IBNqqbLqra
🎥 Videos DiTs are painfully slow, HunyuanVideo takes 16 min to generate a 5s 720P video on H100. 🤯
Announcing Sliding Tile Attention (STA):
* Accelerate 3D full attention (FA3) by up to 10x
* Slash the end-to-end time from 16 --> 5 mins
* NO extra training. NO quality loss! 🚀
Can you tell which videos are generated by the original HunyuanVideo, and which by STA? 👀
Blog: https://t.co/5kwzENjHjk
🎥 Frustrated by Sora's credit limits? Still waiting for Veo 2?
🚀 Open-source video DiTs are actually on par. We introduce FastVideo, an open-source stack to support fast video generation for SoTA open models. We have supported Mochi and Hunyuan, 8x faster inference, 720P 5-second video in 62 seconds.
We are excited to share works from our amazing lab members and collaborators at #NeurIPS2024! 💡✨
Come and discuss our latest research about LLM serving scheduling, training and inference with emerging architectures, and more!
1️⃣ Poster: Efficient LLM Scheduling by Learning to Rank
📍 Location & Time: Fri 11am@East Exhibit Hall A-C #2608
🧑🎓 Leads: @FuYichao123
📜 TL;DR: LLM-LTR is an efficient LLM serving system that reduces latency by approximating Shortest Job First (SJF) scheduling through learning-to-rank techniques.
2️⃣ Poster: Megalodon: Efficient LLM Pretraining and Inference with Unlimited Context Length
📍 Location & Time: Wed 4:30pm@East Exhibit Hall A-C #2002
🧑🎓 Leads: @MaxMa1987, @_xiaomengy_, @violet_zct
📜 TL;DR: Megalodon is a pre-trained model that employs a novel neural architecture with better long-sequence modeling capability and inference-time efficiency.
We are excited to announce our lab's papers at #ICML2024! 🧠✨
Come and discuss our latest research from LLM evaluation to efficient LLM serving & inference! See you there!
1️⃣ Poster: MuxServe: Flexible Spatial-Temporal Multiplexing for Multiple LLM Serving
📍 Location & Time: poster session 1 — Hall C 4-9 #816, 11:30 AM on Tuesday July 23
📜 TL;DR: MuxServe Boosts multiple LLM serving throughput by up to 1.8x through flexible spatial-temporal multiplexing.
2️⃣ Poster: Break the Sequential Dependency of LLM Inference Using Lookahead Decoding
📍 Location & Time: poster session 2 — Hall C 4-9 #411, 1:30 PM on Tuesday July 23
📜 TL;DR: An exact and parallel decoding algorithm that accelerates LLM decoding without needing auxiliary models or data stores.
3️⃣ Poster: Chatbot Arena: An Open Platform for Evaluating LLMs by Human Preference
📍 Location & Time: poster session 3 — Hall C 4-9 #709, 11:30 AM on Wednesday July 24
📜 TL;DR: Chatbot Arena is an open platform for evaluating LLMs based on human preferences through crowdsourced pairwise comparisons, and it’s becoming a widely cited leaderboard for its robust and credible evaluation methods.
4️⃣ Poster: CLLMs: Consistency Large Language Models
📍 Location & Time: poster session 4 — Hall C 4-9 #604, 1:30 PM on Wednesday July 24
📜 TL;DR: We introduce a new family of LLMs optimized for fast Jacobi decoding, achieving a 2.4x to 3.4x improvement in generation speed across multiple benchmarks without compromising quality.
5️⃣ Poster: Online Speculative Decoding
📍 Location & Time: poster session 5 — Hall C 4-9 #605, 11:30 AM on Thursday July 25
📜 TL;DR: OSD improves the efficiency of large language model inference by continuously updating the draft models with user query data, resulting in a significant reduction in latency and an increase in token acceptance rates.
6️⃣ Poster: InferCept: Efficient Intercept Support for Augmented Large Language Model Inference
📍 Location & Time: poster session 5 — Hall C 4-9 #709, 11:30 AM on Thursday July 25
📜 TL;DR: InferCept is the first inference framework for augmented LLMs, efficiently serving LLMs that can query tools, ML models, and virtual environments.
#ICML2024 Join us for a 2-hour tutorial on Monday, July 22, focusing on advanced algorithms and systems for efficient LLM serving. The session will include our recent research on:
✨ Mirage: Auto-gen performant GPU kernels for LLMs
💸 SpotServe: Cost-effective LLMs on spot instances
🌳 SpecInfer: Tree-based speculative decoding techniques
🔧 FlexLLM: Co-serving LLM inference & finetuning
Multiple LLM serving has emerged as a crucial and costly demand.
Want to co-serve multiple LLMs with better utilization?
Introducing MuxServe
- flexible spatial-temporal multiplexing
- up to 1.8x higher throughput
Blog: https://t.co/Pep94vUFTw
Paper: https://t.co/X1Jhov3QOY
Multiple LLM serving has emerged as a crucial and costly demand.
Want to co-serve multiple LLMs with better utilization?
Introducing MuxServe
- flexible spatial-temporal multiplexing
- up to 1.8x higher throughput
Blog: https://t.co/Pep94vUFTw
Paper: https://t.co/X1Jhov3QOY