Many people think any given ML project is 99% training.
In reality, it’s 50% evaluation, 40% data cleaning, 8% integration, and 2% training.
The first two set the noise floor for learning. No ML magic matters; the model cannot lower the noise floor, as that’s the optimal bound of Shannon encoding of your data.
Thus, not a single day goes by without me thinking about ontology. Even the old labels have to be constantly reviewed.
Every Spring, I'm excited to read through the comprehensive ~100-page "State of Machine Learning Competitions" report, which offers many interesting insights into current trends, useful tools, and emerging methodologies in the field. Below are some key takeaways from the latest 2024 report (https://t.co/ERfoItrYFh):
1) Language & frameworks
- Python remains the dominant language, with 76 out of 79 winning solutions.
- PyTorch continues to be the deep learning framework of choice, with 53 out of 60 deep learning competition winners.
2) Hardware trends
- Over 80% of winning teams used NVIDIA GPUs (with A100s being the most popular)
- Interestingly, there's still no mention of AMD GPUs.
- I'm surprised no solution utilized more than an 8xH100 server, which suggests that multi-node setups are either underutilized or underreported.
3) Efficiency Techniques
- Techniques like LoRA are still popular choices for reducing training compute requirements, but many now opt for full finetuning for improved modeling performance.
- And 8-bit and 4-bit quantization remain the most popular approaches for lowering inference compute requirements.
4) LLM reasoning
- The integration of chain-of-thought reasoning and inference-time scaling already made its way into competitions. But these approaches currently rely on simplistic majority voting rather than advanced verifier LLMs (I expect more sophisticated implementations soon)
5) Computer vision
- Interestingly, most winning solutions in computer vision competitions are CNN- rather than transformer-based.
Bonus: In one of the chapters of my LLM book, I described training a decoder-style LLM (GPT) for classification, which is a concept that surprised many readers. Interestingly, the report mentioned that many NLP competitions used decoder-style LLMs for classification tasks as well:
> [...] several competitions seemed designed specifically with these powerful new decoder LLMs in mind. [...] The most commonly-used decoder models among competition winners in 2024 were variants of Llama, Mistral, Gemma, Qwen, and DeepSeek models. Several competition winners used only decoder models."
However, I recently saw the release of ModernBERT by Jeremy Howard's team, and I recommend at least trying this new encoder-style model before jumping to (often larger) decoder-style LLMs.
Just got back from an amazing trip to South Korea! The beaches in Busan were stunning; we spent the evenings by the sea, enjoying the breeze, listening to music, and watching fireworks in the distance. The sound of the waves was truly beautiful.
We just released a new climate emulator to explore the application of Stratospheric Aerosol Injection (SAI) to mitigate global warming!
SAI uses reflective particles in the atmosphere to reflect sunlight and thereby cool Earth’s surface. Our emulator lets you explore how different ways to apply SAI might affect average global temperature.
Please check out the emulator at https://t.co/OxtaQMyDuL.
SAI is a promising direction, but we still need more research to better understand its impact and potential implementation.
Big thanks to collaborators @jeremy_irvin16@DanVisioni Ben Kravitz @dakotagruener@chrisroadmap and @DWatsonParris
Nice read on the rarely-discussed-in-the-open difficulties of training LLMs. Mature companies have dedicated teams maintaining the clusters. At scale, clusters leave the realm of engineering and become a lot more biological, hence e.g. teams dedicated to "hardware health".
It can be a frustrating daily life experience of training large models to "babysit" the training run. You're there carefully monitoring the vital signs of your run: loss spikes, numerical issues, throughput, gradient norms, policy entropy, etc. Every time the run degrades or flatlines (can happen often), you quickly look for the stack trace to see what's up. You have to do this fast or 10,000 GPUs could be idling. Often, it is a new, exotic, scary-looking error you've never seen before so you summon help to see if anyone can see what's up. The worst ones like to occur at 4am. Often no one can, so you just ban some nodes that look a bit sketchy and try to restart the run. Sometimes the run goes down just because you have not earned the favors of your gods that day, so you put a while True: loop around your launch command. The underlying issues can be highly diverse, from some GPUs just getting a bit too hot and suddenly doing incorrect multiplication once in a while, to some router going down and decreasing the networked file system I/O, to someone in the datacenter physically disconnecting a wire as part of an un-communicated maintenance. Sometimes you'll never know.
Another necessary related citation here is the famous OPT-175B logbook and I'd hope more like it can see the light of day in the future. (see chronicles/OPT175B_Logbook.pdf in the git repo)
https://t.co/6xOHVtj0Gf
TLDR LLM training runs are significant stress-tests of an overall fault tolerance of a large computing system acting as a biological entity. And when you're shopping around for your compute, think about a lot more than just FLOPs and $. Think about the whole service from hardware to software across storage, networking, and compute. And think about whether the team maintaining it looks like The Avengers and whether you could become best friends.
Regardless of the circumstances, take action without overthinking. Understand the essence of this statement, and prioritize action over words. Act, act, act!
In this way, I can get what I need in most cases. Now, when I work, I always have a screen with ChatGPT, and I use Genie and Copilot in VSCode for assistance. It's quite enjoyable, and my workflow has completely changed.
But GPT may have already read most of the things I search for and can provide me with a summary directly. I can then have it validate whether there are any missing details or correct any errors.