What if we could teach LLMs to be algorithm inventors?
I trained an LLM to improve sorting algorithms through pure reinforcement learning - and it discovered optimizations giving 47.92x speedups over an optimized python based Timsort baseline! No cold-start data needed.
I used @huggingface grpo implementation and @Alibaba_Qwen 7b model.
The hardest part about finetuning LLMs is that people generally don't have high-quality labeled data. Today, @databricks introduced TAO, a new finetuning method that only needs inputs, no labels necessary. Best of all, it actually beats supervised finetuning on labeled data.
BASED ✌️ turns 1! One year since its launch at NeurIPS 2023 — and it's helped shape the new wave of efficient LMs.
⚡️ Fastest linear attention kernels
🧠 405B models trained on 16 GPUs
💥 Inspired Mamba-v2, RWKVs, MiniMax
Checkout our retrospective below!
Jointly announcing EAGLE-3 with SGLang: Setting a new record in LLM inference acceleration!
- 5x🚀than vanilla (on HF)
- 1.4x🚀than EAGLE-2 (on HF)
- A record of ~400 TPS on LLama 3.1 8B with a single H100 (on SGLang)
- 1.65x🚀in latency even for large bs=64 (on SGLang)
- A new scaling law: more training data, better speedup
- Apache 2.0
Paper: https://t.co/u6mQ6U9xTT
Code: https://t.co/Hnhnwb9iJ3
SGLang version: https://t.co/8tCSDjCktY
⚒️Takeaway: Introducing training-time test, a novel draft model training technique: we replace feature prediction with direct token prediction and shift from top-layer-only features to multi-layer feature fusion. This approach unlocks a new scaling law previously undiscovered in EAGLE and EAGLE-2.
🙏Acknowledge: We would like to thank the SGLang team (@zhyncs42@lm_zheng@ying11231@JamesLiuID, @ispobaoke, and others @lmsysorg) for their merge and careful evaluation of EAGLE-3 on SGLang.
🤝Want to collaborate? We're a small academic group with limited GPU resources. If you're interested in supporting our next version of EAGLE or would like us to train a preliminary version tailored to a specific model, please get in touch!
Joint work with Yuhui Li, Fangyun Wei, and Chao Zhang
LLMs for GPU kernel🌽generation have been getting Pop🍿ular since our preview last Dec; excited to announce 📢 our full paper 📃 for KernelBench!
Turns out KernelBench is quite challenging 🧠 — frontier models outperform the PyTorch Eager baseline <20% of the time.
More 🧵👇
Thrilled to see @istoica05 joining X and couldn't agree more with his insights on the importance of shared infrastructure. "Open source" encompasses more than just open weights—it includes open data, open artifacts, and open infrastructure!
Unbelievable results, feels like a dream—our R1 model is now #1 in the world (with style control)! 🌍🏆 Beyond words right now. 🤯 All I know is we keep pushing forward to make open-source AGI a reality for everyone. 🚀✨ #OpenSource#AI#AGI#DeepSeekR1
@nikitamounier@hongyangzh I’m not sure the latest status — last I checked there was an accuracy issue causing lower acceptance rate. Maybe @CodyHaoYu or @eqhylxx have more up to date info
Once the AI labs realize they need to make products for survival, they will immediately reformulate their strategy to competing with the most obvious working thing that is vaguely under the guise of the original mission.
You should presume you will be ruthlessly copied.
Excited to finally release our NeurIPS 2024 (spotlight) paper! We introduce Run-Length Tokenization (RLT), a simple way to significantly speed up your vision transformer on video with no loss in performance!
1/7 🧵 MoEs: A tale of expectation vs reality
Marketing: "Only compute the expert parameters you need!"
Reality: Batch 16 requests → ALL experts activate
At serving time (vLLM/TGI), arithmetic intensity:
AI ≈ (num_tokens * top_k) / total_experts
In simpler terms: Your decode arithmetic intensity scales inversely with expert count 🤔
#MoE #LLMs #ChatGPT #Claude #vllm #AI #ML
Pie: Pooling CPU Memory for LLM Inference
paper: https://t.co/HPsU3exTFJ
Pie is an LLM inference framework that tackles the memory challenges of large models by enabling efficient GPU-CPU memory swapping and adaptive expansion. It optimizes memory usage without increasing latency, achieving up to 1.9x higher throughput and 2x lower latency compared to alternatives like vLLM, while reducing GPU memory usage by up to 1.67x.
🍎 The core of Kinetix is our new 2D rigid body physics engine: Jax2D. This is a minimal rewrite of the classic Box2D engine made by @erin_catto. Jax2D allows us to run thousands of heterogeneous parallel environments on a single GPU (yes, you can vmap over different tasks!)
8/