All of these need to be saved through async process and those checkpoint should also be sharded because such gigantic model, checkpoints will cost significant amount of storage.
Llama 3 450B was trained using 16384 GPUs. It took around 54 days to train the model. According to job logs, there were around 466 interruptions during that period. Its not just weights which need to be saved to resume training from a nearest checkpoint to failures.
@paraschopra@liquidai we should start a project for open source hardware similar to RISC-V and then companies can use our sovereign hardware as well. WDYT ?
@paraschopra I see there are lot of companies like @liquidai across US. We are not even doing that. Even if build great SLM and able to generate revenue around various usecases, company would have good enough funding to buy new hardware. Also I would say from India ... (1/n)
@itsreallyvivek I was tired after reading great blog of crystallisation of transformer. But I just opened this article to see why its going viral and your writing compelled me to finish it
If you want to learn evolution of building blocks of LLMs and keep up with delta changes since transformer was released, this is a good read
https://t.co/y99lgGvI7E
A French engineer who lives quietly in Paris has spent 30 years writing software that the entire internet now runs on without knowing his name.
He wrote the code that streams every YouTube video, every Netflix show, every TikTok clip. He wrote the code that runs the virtual servers underneath AWS, Google Cloud, and Microsoft Azure. He calculated more digits of pi than anyone in history. He has no Twitter. He has no marketing. He just keeps shipping.
His name is Fabrice Bellard.
Here is the story, because almost nobody outside the systems programming world knows what one man has built.
Fabrice was born in 1972 in Grenoble, France. He studied at École Polytechnique, the top French engineering school. He never went to Silicon Valley. He never built a startup empire. He just wrote code.
In 2000 he started a project called FFmpeg, an open-source multimedia framework for encoding, decoding, and streaming video. He was 28. The project did one thing nobody else had done well. It handled every video and audio format that existed, in one library, on every operating system. He led it himself for years.
Today FFmpeg is the invisible engine of the internet. YouTube uses it. Netflix uses it. VLC uses it. Chrome and Firefox use parts of it. Every Android phone, every iPhone, every smart TV, every video editing tool you have ever touched runs FFmpeg somewhere underneath. If you have watched a video on a screen in the last 20 years, Fabrice's code processed it.
He was not done.
In 2003 he started QEMU, a machine emulator and virtualizer. He wrote it solo until version 0.7.1 in 2005. QEMU lets you run any operating system on any other operating system. It became the foundation of modern virtualization. KVM, the Linux kernel hypervisor, runs on top of QEMU. Every major cloud provider, AWS, Google Cloud, Microsoft Azure, IBM Cloud, runs virtual machines on infrastructure built around it. The Quick Emulator is the most cited piece of cloud infrastructure code on Earth.
He kept going.
In 2001 he won the International Obfuscated C Code Contest with a small C compiler that grew into TCC, the Tiny C Compiler. TCC can compile and boot a Linux kernel from source in under 15 seconds. In 2004 he calculated the most digits of pi ever computed at the time, using a personal desktop computer and an algorithm he derived himself called Bellard's formula. In 2011 he wrote a complete PC emulator in pure JavaScript that runs Linux in your browser, a project called JSLinux that engineers still cannot believe is real.
In 2019 he released QuickJS, a small but complete JavaScript engine that fits where V8 cannot. In 2021 he released NNCP, a neural network based lossless data compressor that immediately took the lead on the Large Text Compression Benchmark.
Then he turned his attention to large language models. He built TextSynth Server, a web server with a REST API for running LLMs locally. He released ts_zip and ts_sms, compression utilities that use language models to compress text and short messages at ratios traditional algorithms cannot reach. He released TSAC, a very low bitrate audio compression system. In December 2025 he released Micro QuickJS, a new JavaScript engine for microcontrollers, separate from QuickJS, designed for environments with almost no memory.
Fabrice co-founded a telecom company called Amarisoft in 2012, where he serves as CTO. Amarisoft builds 4G and 5G base station software used by carriers and labs around the world. He has been running it for over a decade while continuing to ship personal projects from his own home page at bellard dot org
He has no Twitter. He has no Instagram. He gives almost no interviews. His personal website is a flat list of projects with no styling, no fonts, no marketing copy. Just titles and links.
A quiet French engineer who never moved to Silicon Valley wrote the code that quietly runs the internet.
He is still shipping.
the anthropic co-founder jack clark advice that stuck with me:
read the primary material. not the summary. not what the ai said about it. the actual thing.
form your own opinion first. then ask the model. never the other way around.
keep practices in your life where it’s just you against the world ~ a sport, an instrument, reading, building something with your hands. spaces where the algorithm can’t mediate what you learn about yourself.
and don’t defer to AI even when it’s usually right. especially then, actually. that’s precisely when the habit forms.
the people who won’t get eaten by this moment are the ones who stayed hard to replace. not because they avoided the tools but because they kept the parts of thinking that make the tools worth using.
New blog! Is frontier asynchronous RL solved?
The blog covers Async RL theory and infrastructure, surveying 8 open-weight frontier labs for the algorithmic techniques and systems fixes to handle train-inference mismatch. Also answered: why do current methods still fail at high policy lag? Which methods scale with horizon and compute?
new grads often ask me what they should be doing so they don't fall behind in the ai space. there's a lot, but its honestly super manageable. become intimate with model internals. proof based linear algebra. non-convex optimization. this is stuff you could've done in undergrad. it definitely takes some time and work, but its doable. have taste, have opinions. train a small model, then train a big one. vLLM internals, tensor parallelism. hand roll kernels. cluster orchestration. do you have opinions on synthetic data? why don't you? SFT, PPO, you should know this. learn Triton. everyone is reproducing papers now so you need to be doing more. do you know the semi supply chain? where are the bottlenecks? hardware, man, hardware. your little gpu rig erector set in your basement isnt gonna cut it. build a cluster, a big one. pretrain a 800B model. now postrain it. serve it to millions of people. you should be able to beat deepseek on some benchmarks now. its a lot to take in but it all snowballs. this what job security looks like from now on. do you want to work in tech or not
🏹5 Days of Trajectory.
Day 3 - An Open Source Training Stack for Continual Learning
Building the platform for continual learning requires both partnering with pioneering AI companies, as we showed on Day 2 with Harvey, and working toward frontier research, which we are highlighting today.
Continual learning means models that improve hourly from real production use. But with the size of frontier models, this becomes quite difficult. A Qwen-397b would need to spin up and tear down repeatedly across six GPU nodes, and that's valuable time gone.
Our contribution is Continual LoRA (C-LoRA): many lightweight adapters running at once on one shared base model. Our insight centers on where the parallelism lives: instead of splitting one giant job across nodes, we load-balance many small jobs over a single base.
The result: 2.81x experiment throughput over single-tenant training, with no regression on rewards.
We built this together, with @anyscalecompute, @NovaSkyAI, and generous support from @GoogleCloud and @GoogleStartups. We've open-sourced on SkyRL as one of the first multi-LoRA, RL training platforms, so that every team can get to continual learning faster.
We’re very excited to see what you build, please reach out!
I am leaving the Foundational Research team at DeepMind!
I just wanted to take the time to reflect on this truly amazing journey. It was such an intense and fulling ride that I will always cherish.
Two efforts shaped me in particular: the reasoning work and building the continual learning infrastructure for robotics. They taught me what it takes to turn ambitious ideas into real systems. Here are some of my biggest takeways:
1. Iteration speed, iteration speed, iteration speed: the teams that win arent neccessarily the smartest but the ones able to execute on a thousand ideas in the time their competitors excute on five. This became way more obvious when we were working on reasoning for humanoids where the iteration contains hardware in the loop. You have to really deeply think about what it takes to test your hypothesis and how to greatly simplify the iteration loop to move faster.
2. Building scalable infrastructure from day 1: Researchers sometimes think that moving fast means building unscalable infrastructure. My time at DeepMind taught me that there is always one more experiment that requires refactoring the entire repo, as those come up, we should figure out how to better build the stack from the ground up to support more and more wacky experiments.
3. Having fun is probably the most important thing at work: When you truly enjoy your colleagues’ company and you are motivated by the success of the larger team, the late nights become memorable, not exhausting. I never truly understood this until the 1am nights at work all huddled near one of our humanoids trying to figure out why its behaving this way.
I’m especially gratefJ to my mentors @sippeyxp , Jie Tan, @Kanishka_Rao, and @carolina_parada for constantly finding harder challenges for me and pushing me to grow.
Peter Pastor, @keerthanpg, and Stefani Karp thank you for the late-night hacking sessions and the PEAK dinners. Those are some of my most treasured memories!
@claudiofantacci, Alex Lee, @Sumeet_Robotics, and Ken Caluwaerts thank you for teaching me how to build scalable infrastructure, from building the new inference stack to scaling experiments.
@Stacormed, @xiao_ted, @ColinearDevin, and Giulia Vezzani I learned so much from you. Thank you for entertaining all my hypotheses (especially the weird ones) and helping me learn through them.
I can go on and on.. I just can’t thank each one of you enough. Truly thankful for the time we spent together!
Will share more soon 👀