AI inference, speculative decoding, open source. Built novel decoding algorithms – default in Hugging Face Transformers (155+ ⭐). Making AI faster + cheaper
@feulf@tanishqkumar07@tri_dao@avnermay Thanks for the ping, Fed! It’s definitely exploring a very similar space to our distributed speculative decoding paper from iclr25 (https://t.co/Hn1wRyD055). Great to see Tanishq, Tri, and Avner working on this too!
How do we build sparsity into JEPA representations by design, while preserving task-relevant information?
Introducing Rectified LpJEPA, a JEPA architecture that learns sparse, non-negative, informative representations through principled distributional regularization. 📐
📄 Paper: https://t.co/CZtKHVvKT4
💻 Code: https://t.co/xIJMYkK9fT
📝 Blog: https://t.co/OFCl3FAmik
(1/n)
Speculative decoding has shown a lot of promise, though broader adoption has taken time due to the complexity of building production-ready tooling and high-quality draft models.
We’re releasing SpecBundle, a collection of large-scale EAGLE-3 draft models trained with SpecForge v0.2. This release brings major system improvements, including refactored training pipelines, multi-backend support with SGLang and @huggingface , and better usability at scale.
We also built a performance dashboard to make real end-to-end speedups visible across models and settings. See the dashboard and blog in the thread 👇
Even w/o training, you can still use speculative decoding. No need to train a speculator per model.
Our spec decoding algos for heterogeneous vocabs (open-sourced in HF Transformers; not yet in vLLM) let any off-the-shelf model serve as the speculator. ♻️
That means day-0 support for new models, and spec decoding for anyone who can’t train.
Speculative decoding is a powerful way to improve inference performance, but in practice it has been hard to adopt.
Training a unique draft model per LLM is time-consuming, and production-ready training utilities that work cleanly with vLLM have been limited.
Speculators v0.3.0 closes this gap with end-to-end training support for Eagle3 draft models that run seamlessly with vLLM.
The release adds offline data generation using vLLM and training support for single and multi-layer draft models, across both MoE and non-MoE verifiers.
Here's a 🧵 on speculative decoding and how to get started today in @vllm_project (1/8):
inference is perhaps the most valuable emerging software category.
as models get smarter and more economically valuable, compute will increasingly be spent drawing samples from the models.
if you'd like to work on inference at openai, reach out — [email protected]. include a description of an exceptional team you've been a part of, and your contribution towards that team's goals. also indicate any experience in inference, large-scale system optimization, or other areas where you've built up domain expertise.
lots of exciting problems to work on, ranging from deeply understanding the model forward pass (including simulating/finding creative opportunities for optimization); to system-level efficiencies such as speculative decoding or kv offloading or workload-aware load balancing; to managing and making observable a massive fleet at scale.
.@NadavTimor and I are going to train a SO-ARM101 with @LeRobotHF at the @huggingface office next week. If you’re in NYC and have an ARM101 and want to join us let me know! BYO arm 🦾
NYC open-source AI infra contributors — we’ve launched a community research hub above Grand Central where GPUs go brrr 🔥🗽
A place to hack, benchmark, and collaborate — vLLM, SGLang, kernels, inference optimizations all welcome.
Open space. Open source. Weekends too.
Huge thanks to @Company for supporting this initiative 🙌
𝐋𝐢𝐦𝐢𝐭𝐞𝐝 𝐬𝐞𝐚𝐭𝐬. 𝐃𝐫𝐨𝐩 𝐲𝐨𝐮𝐫 𝐏𝐑𝐬 𝐢𝐧 𝐭𝐡𝐞 𝐜𝐨𝐦𝐦𝐞𝐧𝐭𝐬 𝐭𝐨 𝐣𝐨𝐢𝐧 𝐭𝐡𝐞 𝐧𝐞𝐱𝐭 𝐬𝐩𝐫𝐢𝐧𝐭!