Ferdinand Mom @FerdinandMom - Twitter Profile

Pinned Tweet

over 1 year ago

Interested in 4D parallelism but feeling overwhelmed by Megatron-LM codebase? We are currently cooking something with @Haojun_Zhao14 and @xariusrke 😉 In the meantime, here is a self-contained script that implements Pipeline Parallelism (AFAB + 1F1B) in 200 LOC 🧵👇

FerdinandMom's tweet photo. Interested in 4D parallelism but feeling overwhelmed by Megatron-LM codebase? We are currently cooking something with @Haojun_Zhao14 and @xariusrke 😉

In the meantime, here is a self-contained script that implements Pipeline Parallelism (AFAB + 1F1B) in 200 LOC 🧵👇 https://t.co/SCbKRknIOF

12

229

44

146

27K

FerdinandMom retweeted

Loubna Ben Allal

@LoubnaBenAllal1

18 days ago

Introducing Carbon 🧬 a family of open generative DNA foundation models. Carbon-3B matches Evo2-7B while running 250x faster at inference. It can generate new DNA sequences and score the functional impact of mutations, zero-shot. We borrowed a lot from how modern LLMs are trained, but DNA isn't language. Genomes are noisy, redundant, and shaped by evolution rather than communication. So we adjusted the recipe: Tokenizer. Most genomic models tokenize at the nucleotide/character level, which blows up sequence length. BPE is the obvious LLM-style fix, but it doesn't behave well on DNA. We use deterministic 6-mer tokens (one token = 6 nucleotides): 6× shorter sequences and cheaper attention. Training loss. With 6-mer tokens, cross-entropy scores a prediction that gets 5/6 nucleotides right the same as one that's completely wrong. This gets brittle late in training and produces loss spikes. We switch mid-training to a more flexible factorized loss (FNS). Data. Genomes are mostly sparse, repetitive background. We curate down to a staged functional DNA + mRNA mixture, with every ratio chosen by ablation, like mixing a web corpus, but for biology. We're releasing the models, training data, training code, evaluation suite, and a demo to play with. More details in the technical report: https://t.co/RMzFmTAhhT Demo to play with the model, with a biology primer for our ML friends ;) https://t.co/IcOQq7GKF4

16

359

82

230

40K

FerdinandMom retweeted

Rémi Ouazan

@remi_or_

23 days ago

Anyone interested in a CUDA deep dive that makes your workload 25% faster? 🧐 Just published a new blog post on asynchronous CPU / GPU inference: 100% insight, zero slop 😊 To learn how to remove all CPU overhead and use your GPU to the max, just read it 🔥

remi_or_'s tweet photo. Anyone interested in a CUDA deep dive that makes your workload 25% faster? 🧐

Just published a new blog post on asynchronous CPU / GPU inference: 100% insight, zero slop 😊
To learn how to remove all CPU overhead and use your GPU to the max, just read it 🔥 https://t.co/g1l4But2x2

1

25

11

14

4K

FerdinandMom retweeted

Arthur Douillard

@Ar_Douillard

about 1 month ago

The DiLoCo team at Google DeepMind and Google Research is proud to release Decoupled DiLoCo, the next frontier for resilient AI pre-training. Decoupled DiLoCo enables training with datacenters across the world, using heterogeneous hardware, and never halting the system despite hardware failures.

33

608

85

302

3M

Who to follow

Salman // 萨尔曼

@ForBo7_

「Open to Projects」 • Dabbler • Learner • Explorer • Logger • https://t.co/jTudwv3AAp student • Dabbling in Embodied AI • 自学中文 // Self-learning Chinese

Young-Jun Lee

@passing2961

Visiting Scholar @ UMN | Ph.D, School of Computing, KAIST | ex-Amazon Applied Scientist Intern

Adil D. Ztn 👒

@AdilZtn

Founding Research Scientist @UMA_Robots 🦾 I'm trying to make reinforcement learning boring. prev @huggingface 🤗 at @LeRobotHF

FerdinandMom retweeted

Aksel

@akseljoonas

about 2 months ago

Introducing ml-intern, the agent that just automated the post-training team @huggingface It's an open-source implementation of the real research loop that our ML researchers do every day. You give it a prompt, it researches papers, goes through citations, implements ideas in GPU sandboxes, iterates and builds deeply research-backed models for any use case. All built on the Hugging Face ecosystem. It can pull off crazy things: We made it train the best model for scientific reasoning. It went through citations from the official benchmark paper. Found OpenScience and NemoTron-CrossThink, added 7 difficulty-filtered dataset variants from ARC/SciQ/MMLU, and ran 12 SFT runs on Qwen3-1.7B. This pushed the score 10% → 32% on GPQA in under 10h. Claude Code's best: 22.99%. In healthcare settings it inspected available datasets, concluded they were too low quality, and wrote a script to generate 1100 synthetic data points from scratch for emergencies, hedging, multilingual etc. Then upsampled 50x for training. Beat Codex on HealthBench by 60%. For competitive mathematics, it wrote a full GRPO script, launched training with A100 GPUs on https://t.co/udm7xGpNzR, watched rewards claim and then collapse, and ran ablations until it succeeded. All fully backed by papers, autonomously. How it works? ml-intern makes full use of the HF ecosystem: - finds papers on arxiv and https://t.co/brvCC7fLPa, reads them fully, walks citation graphs, pulls datasets referenced in methodology sections and on https://t.co/hrJuRkRyzi - browses the Hub, reads recent docs, inspects datasets and reformats them before training so it doesn't waste GPU hours on bad data - launches training jobs on HF Jobs if no local GPUs are available, monitors runs, reads its own eval outputs, diagnoses failures, retrains ml-intern deeply embodies how researchers work and think. It knows how data should look like and what good models feel like. Releasing it today as a CLI and a web app you can use from your phone/desktop. CLI: https://t.co/l3K1PslZ1n Web + mobile: https://t.co/orko5srL4H And the best part? We also provisioned 1k$ GPU resources and Anthropic credits for the quickest among you to use.

136

5K

641

6K

1M

FerdinandMom retweeted

clem 🤗

@ClementDelangue

2 months ago

Next steps: - enable the 50,000 models available in inference providers - enable the 3,000,000 models available on HF - local free fast inference with llama.cpp - train and bring your own model! We don't want a world where you're forced to choose between two or three lookalike models with the same biases, limitations, forced to pay fortunes in tokens even for small tasks and send all your data to the cloud. We want a world where you have real model choice, options and freedom for your agents. Cloud, local, small, big, specialized, general, English or French, fast or slow, from six months ago or from six seconds ago, from third party or your own! Let's go!

34

688

64

206

63K

FerdinandMom retweeted

Paul S. Conyngham

@paul_conyngham

2 months ago

https://t.co/bpa3HHt8Mg

245

5K

935

5K

3M

FerdinandMom retweeted

Arthur Douillard

@Ar_Douillard

2 months ago

Training distributed DiLoCo / SparseLoCo over eduroam wifi, awesome!

0

60

16

11

9K

FerdinandMom retweeted

Swarnim Jain

@swar_ja

2 months ago

I trained models across MacBooks using Apple's AirDrop protocol. grove is a distributed training library for Apple Silicon. Devices discover each other over AWDL, a direct radio link. If there's a shared WiFi network it upgrades to that for speed, otherwise everything goes over the direct link. No router, no cloud, no setup. grove start <script> -n 4 grove join

161

3K

261

2K

666K

FerdinandMom retweeted

Lewis Tunstall

@_lewtun

3 months ago

You can now pretrain LLMs entirely on the HF Hub 💥 Last week, @OpenAI launched a competition to see who can pretrain the best LLM in under 10 minutes. So over the weekend, I made a little demo to automate this end-to-end using the Hub as the infra layer: - Jobs to scale compute - Buckets to store all experiments - Trackio to log all the metrics The cool thing here is that everything is launched locally: no ssh shenanigans into a cluster or fighting with colleagues over storage and GPUs ⚔️ All that's left is coming up with new ideas, but luckily Codex can automate that part too 😁 Can I have a job now please @reach_vb 🙏?

14

247

40

190

77K

Ferdinand Mom

@FerdinandMom

3 months ago

@DistStateAndMe lol

0

1

0

62

FerdinandMom retweeted

Sam Dare

@DistStateAndMe

3 months ago

A small step for mankind, a massive leap for decentralised training... for agency. In the space of 9 months, @tplr_ai went from 1.2B -> 72B. It's never been easy, and has broken everyone on the team multiple times. But I speak for all of us when I say it is the most rewarding thing we have ever done. We have a fraction of the resources. We don't have the PhDs. But Bittensor shows you it doesn't matter. Innovation happens at the edge. We innovate through scarcity. The ones who rewrite the rules are never the ones with the most. They're the ones who refuse to accept the limits they were handed. Bittensor is prophecy. Subnets (@covenant_ai and others) are the tools through which that prophecy is manifested. Next stop: TRILLIONS.

39

262

35

38

70K

FerdinandMom retweeted

templar @tplr_ai

3 months ago

We just completed the largest decentralised LLM pre-training run in history: Covenant-72B. Permissionless, on Bittensor subnet 3. 72B parameters. ~1.1T tokens. Commodity internet. No centralized cluster. No whitelist. Anyone with GPUs could join or leave freely. 1/n

216

6K

916

4K

2M

FerdinandMom retweeted

Hugging Face

@huggingface

3 months ago

🪣 We just shipped Storage Buckets: S3-like mutable storage, cheaper & faster Git falls short for everything on high-throughput side of AI (checkpoints, processed data, agent traces, logs etc) Buckets fixes that: fast writes, overwrites, directory sync 💨 All powered by Xet dedup so successive checkpoints skip the bytes that already exist ➡️

huggingface's tweet photo. 🪣 We just shipped Storage Buckets: S3-like mutable storage, cheaper & faster

Git falls short for everything on high-throughput side of AI (checkpoints, processed data, agent traces, logs etc)

Buckets fixes that: fast writes, overwrites, directory sync 💨

All powered by Xet dedup so successive checkpoints skip the bytes that already exist ➡️

19

392

68

157

69K

FerdinandMom retweeted

Ethan He

@EthanHe_42

3 months ago

My last open-source project before joining xAI is just out today. Megatron Core MoE is probably the best open framework out there to seriously train mixture of experts at scale. It achieves 1233 TFLOPS/GPU for DeepSeek-V3-685B. https://t.co/QA1KRGu2Nc

EthanHe_42's tweet photo. My last open-source project before joining xAI is just out today. Megatron Core MoE is probably the best open framework out there to seriously train mixture of experts at scale. It achieves 1233 TFLOPS/GPU for DeepSeek-V3-685B. https://t.co/QA1KRGu2Nc

39

986

105

506

82K

FerdinandMom retweeted

Autism Capital 🧩

@AutismCapital

3 months ago

@AnthropicAI New Anthropic

59

6K

226

162

439K

FerdinandMom retweeted

P.M

@p_misirov

4 months ago

there is a game called "data center" on steam which let's you build and manage your own data center. this is lowkey genius, the best way to educate people on a new trait. hyperscalers should learn a thing or two from "edutainment".

434

36K

3K

19K

7M

FerdinandMom retweeted

Thalaiyasingam Ajanthan

@tha_ajanthan

4 months ago

Breaking the Synchronization Bottleneck in Distributed Training with AsyncMesh. Communication overhead in synchronous data and pipeline parallelism restricts distributed training of large language models to co-located clusters with high-bandwidth interconnects. Our recent work from @Pluralis introduces AsyncMesh, which enables fully asynchronous optimization across both parallelism axes. By eliminating blocking communication, this avoids idle time, improves throughput, and enables efficient utilization of heterogeneous hardware. Asynchrony, however, introduces optimization challenges due to staleness between PP stages and DP replicas. For PP, we use our prior Nesterov-style weight look-ahead method to compensate for stage-dependent gradient delay. For DP, we introduce asynchronous sparse averaging, communicating only a small subset of parameters, and correcting delay via an EMA-based staleness estimator. We observe that sparse averaging is inherently robust to weight inconsistencies (e.g., staleness and quantization noise), making it well-suited for asynchronous settings while also substantially reducing data transfer between replicas. Empirically, we observed no performance degradation compared to fully synchronous training across a range of LLM training configurations, while significantly reducing communication overhead. More broadly, AsyncMesh makes distributed training feasible beyond co-located, high-speed clusters, facilitating large-scale collaborative training over the internet. The attached video illustrates the key concepts of the method and the paper can be found here: https://t.co/1nSUS0y9ri.

4

45

9

16

9K

FerdinandMom retweeted

Alexander Long

@AlexanderLong

4 months ago

Fully Asynchronous Pipeline Parallel + Async SPARTA on the DP axis. Microatches constantly move through through the system, no pipeline bubble or pause while you do an expensive all-reduce. Straightforward to implement. Walltime goes down a lot.

2

18

3

7

3K

FerdinandMom retweeted

nic lane

@niclane7

4 months ago

The next session of the @Cambridge_Uni ML Systems Seminar Series (@CaMLSys Lab) is coming up on Monday, Feb 16 at 2:30pm. We are pleased to host @FerdinandMom from @huggingface presenting "Bringing distributed training natively to Transformers library". The seminar will take place in LT1 @Cambridge_CL. Ferdinand Mom is a Research Engineer at Hugging Face with a background in large-scale pretraining and efficient deep learning systems. He is a contributor to the Hugging Face Transformers library: https://t.co/6swjLSuEQ8 -- and co-author of the Ultra-Scale Playbook: https://t.co/zPlTx5D5tJ. Ferdinand is a leading voice and experimentalist in distributed and decentralized training, pushing the limits of scalable open-source AI. talks @ cam link: https://t.co/A4lzB3BkI3