โ๏ธ Heading to ICLR ๐ง๐ท Apr 22โ27.
Come to our oral on Fri, Apr 24 (10:30 AMโ12:00 PM, Room 202 A/B) or find me at our poster (3:15 PMโ5:45 PM, P3-#521).
We study why LR decay can hurt curriculum-based LLM pretraining โ and how to fix it.
Happy to chat!
๐ Great honor to collaborate with @BranSun10, @Dunk_KD1998, @Harry_Chen_, with advice from Professor Kaifeng Lyu @vfleaking, and under the support and leadership of Professor Wenguang Chen.
Thanks to all contributors who made this work possible!
๐ Announcing PCMind-2.1-Kaiyuan-2B
A new frontier for fully open-source models.
Not just weightsโfull pretraining pipeline & recipe.
Specs: 2B params, 2.2T tokens
Approach: data-centric pretraining
Status: SOTA among fully-open models
๐ค HF: https://t.co/G86k7ja08P
โ๏ธ Infrastructure: Kaiyuan-Spark
Built on Spark & Chukonu (https://t.co/8ZSX4EJcti) for scale.
- Capabilities: Massive deduplication & mixing.
- Speed: Optimized C++ kernels.
- Reproducibility: Reconstruct our exact training set via config files.
๐ Innovation 3: Quality Curriculum
Samples sorted by quality (ascending), then interleaved globally.
- Progressive Exposure: Model sees "textbook quality" data only when mature.
- Stable Mix: Domain ratios (Chinese/Code/Math) remain fixed while quality ramps up.
๐ Innovation 2: Strategic Repetition
High-quality data is finite. We use a multi-phase approach to repeat the best data without overfitting.
Method: Retain top 50% โ 30% โ 10% in later phases.
Result: Top 10% samples seen 4x; low-quality samples seen only once.
๐งChallenge: Heterogeneity & Scarcity
Open datasets (DCLM, FineWeb) are great but vastly different. High-quality tokens are potent but rare.
How to compare/mix heterogeneous sources?
How to max efficiency with sparse "gold" data?
Focus on these and run data-centric training.๐
๐ข Come meet us at #ICLR2025!
We'll be presenting our Multi-Power Law โ a new approach to predicting full pretraining loss curves across LR schedules โ during the poster session:
๐ Friday, April 25
๐ 3:00 PM โ 5:30 PM CST
๐ Hall 3 + Hall 2B, Poster #237
Expect your feedback!
๐How does pretraining loss evolve under different LR schedules?
๐Meet our Multi-Power Law: predicts the full loss curve for various schedules!
๐Accurate enough to optimize LR schedules directly.
๐Result? A WSD-like schedule that outperforms the rest!
๐ฅAccepted at #ICLR2025
๐ข Come meet us at #ICLR2025!
We'll be presenting our Multi-Power Law โ a new approach to predicting full pretraining loss curves across LR schedules โ during the poster session:
๐ Friday, April 25
๐ 3:00 PM โ 5:30 PM CST
๐ Hall 3 + Hall 2B, Poster #237
Expect your feedback!
๐น Using predicted final loss as a surrogate objective, we induce an optimized scheduleโmatching WSD (Hu et al., 2024) in shape but achieving even lower loss!
๐How does pretraining loss evolve under different LR schedules?
๐Meet our Multi-Power Law: predicts the full loss curve for various schedules!
๐Accurate enough to optimize LR schedules directly.
๐Result? A WSD-like schedule that outperforms the rest!
๐ฅAccepted at #ICLR2025
๐ก Results at a glance:
๐น Our law is fitted on the schedules in the first rowโthen accurately predicts loss curves for unseen schedules in the second row!