We just posted our paper: “M^3 Scaling Law: Optimizing Multi-Epoch, Multi-Lingual, and Multi-Stage Training for Low-Resource Language Models.”
Joint work with @kanaheinousagi and @stillpedant.
In this thread, I’ll explain the main idea and key findings. (1/N)