Tristan Rice

@rice_fry

Machine Learning + Distributed Systems + Hardware Hacking SWE @pytorch, tweets are personal opinions don't use Twitter much anymore

Seattle / Vancouver

Joined August 2013

181 Following

5.7K Followers

553 Posts

Tristan Rice @rice_fry

4 months ago

@vr4300 Been slowly getting back into self driving things, figured it'd be fun to hack on the hw4 NPU. Still baby steps though. It would be super cool to run my own models on it

Tristan Rice @rice_fry

4 months ago

Very late news but looks like TRIP v2 (Tesla HW4 NPU) has 48MB of SRAM, up from 32MB on HW3

rice_fry retweeted

SemiAnalysis

@SemiAnalysis_

8 months ago

Meta has open sourced their CTran library that natively works with AMD & NVIDIA GPUs 🚀. Previously, if u want multiple NVIDIA GPUs to work together on an workload, you must used the NVIDIA NCCL library. Although NCCL's source code is public, it does not have an open governance model, does not have open CI, employs an "code dump" update model, is not GitHub first, and rarely accepts external contributions. Previously, If you want multiple GPUs to work together on an workload, you must used the AMD fork called RCCL library, which is a delayed fork of NVIDIA's NCCL. With CTran, it is 1 unified library and allows for adding new like Bruck's in an way such that the code can be shared between different AI GPU types. Furthermore, Meta has open sourced NCCLX (NCCL extended) which is their production-tested collective library that powered all Llama training and uses the unified CTran library. Meta is the creator & main maintainer of PyTorch and is well trusted in the open source community. NVIDIA continues to be the leader in collective libraries but Jensen must not taken it for granted given the heavily increased competition in the open source collective communication space. Just like how TRTLLM moved to an GitHub first development when facing heavy competition from SGLang/vLLM, Jensen should seriously consider moving NCCL to GitHub first open development model due to the competition in the collective front too. To draw parallel comparisons to the inference engine world, Collective Communication Libraries are moving from the 2021 "FasterTransformer" era to the 2025 "SGLang/vLLM/TRTLLM" era. The main competitors in the collective library space include China's DeepEP library, AMD's new MORI, AMD's upcoming MORI-CCL, Meta's CTran & NCCLX, NVIDIA's NCCL (which has released their new NCCL Device API, NCCL's new GPU-Initiated Networking, etc). Competition breeds innovation! 🚀

SemiAnalysis_'s tweet photo. Meta has open sourced their CTran library that natively works with AMD & NVIDIA GPUs 🚀. Previously, if u want multiple NVIDIA GPUs to work together on an workload, you must used the NVIDIA NCCL library. Although NCCL's source code is public, it does not have an open governance model, does not have open CI, employs an "code dump" update model, is not GitHub first, and rarely accepts external contributions. Previously, If you want multiple GPUs to work together on an workload, you must used the AMD fork called RCCL library, which is a delayed fork of NVIDIA's NCCL. With CTran, it is 1 unified library and allows for adding new like Bruck's in an way such that the code can be shared between different AI GPU types.

Furthermore, Meta has open sourced NCCLX (NCCL extended) which is their production-tested collective library that powered all Llama training and uses the unified CTran library. Meta is the creator & main maintainer of PyTorch and is well trusted in the open source community.

NVIDIA continues to be the leader in collective libraries but Jensen must not taken it for granted given the heavily increased competition in the open source collective communication space. Just like how TRTLLM moved to an GitHub first development when facing heavy competition from SGLang/vLLM, Jensen should seriously consider moving NCCL to GitHub first open development model due to the competition in the collective front too. To draw parallel comparisons to the inference engine world, Collective Communication Libraries are moving from the 2021 "FasterTransformer" era to the 2025 "SGLang/vLLM/TRTLLM" era.

The main competitors in the collective library space include China's DeepEP library, AMD's new MORI, AMD's upcoming MORI-CCL, Meta's CTran & NCCLX, NVIDIA's NCCL (which has released their new NCCL Device API, NCCL's new GPU-Initiated Networking, etc). Competition breeds innovation! 🚀

333

159

30K

Tristan Rice @rice_fry

12 months ago

It's been a blast working on this! I'm very excited about the future of fault tolerant training

PyTorch

@PyTorch

12 months ago

torchft + TorchTitan: 1200+ failures, no checkpoints, model convergence. A Llama 3 model was trained across 300 L40S GPUs with synthetic failures every 15s. No restarts. No rollbacks. Just asynchronous recovery and continued progress. 📘 https://t.co/PeLDx14CRb #PyTorch #DistributedTraining #FaultTolerance #OpenSourceAI

PyTorch's tweet photo. torchft + TorchTitan: 1200+ failures, no checkpoints, model convergence.

A Llama 3 model was trained across 300 L40S GPUs with synthetic failures every 15s. No restarts. No rollbacks. Just asynchronous recovery and continued progress.

📘 https://t.co/PeLDx14CRb

#PyTorch #DistributedTraining #FaultTolerance #OpenSourceAI

390

163

77K

Who to follow

green

@greentheonly

I report what I see. If it's good, it's good; if it's bad, it's bad. Does not depend on me. Make them release more awesome stuff. Don't shoot the messenger.

Frontier Clusters at OpenAI. Formerly DeepMind, Tesla AI, X, NVIDIA.

rice_fry retweeted

Mark Saroufim

@marksaroufim

about 1 year ago

If you’re excited about optimizing code that runs equally well on a single or thousands of GPUs and if you have the ability to submit a single substantial PR to a major OSS library, we want you on the PyTorch team - especially if you’re early in your career.

280

145

61K

rice_fry retweeted

Soumith Chintala

@soumithchintala

about 1 year ago

If GPU optimization and systems problems excite you, why limit your impact to a single company or lab? Working on PyTorch allows you to ship impact to the entire AI industry! We're hiring across experiences -- junior and senior engineers. Read more below 👇

298

33K

Tristan Rice @rice_fry

about 1 year ago

@StasBekman You could pass `NCCL_DEBUG_SUBSYS=ALLOC` which may give some info on the memory allocations

157

Tristan Rice @rice_fry

about 1 year ago

@StasBekman That buffer is 4MB per *peer*. How many GPUs are you running with and what network topology and collectives are you using? I'm trying to figure out if we can track this but AFAIK the PyTorch profiler tracks PyTorch allocations and we can't inspect into NCCL internal allocations

179

Tristan Rice @rice_fry

about 1 year ago

@alex_peys @main_horse @PyTorch I just double checked, the DDP wrapper and any dist.* operation throws an error if init_process_group hasn't been called torchrun working with any script even if it's not distributed is actually pretty useful -- I've used it for doing bulk inference from time to time

rice_fry's tweet photo. @alex_peys @main_horse @PyTorch I just double checked, the DDP wrapper and any dist.* operation throws an error if init_process_group hasn't been called

torchrun working with any script even if it's not distributed is actually pretty useful -- I've used it for doing bulk inference from time to time https://t.co/huQreJEh39

Tristan Rice @rice_fry

about 1 year ago

@StasBekman You can also control the NCCL buffer size via env https://t.co/rm3zwswxrS

Tristan Rice @rice_fry

about 1 year ago

@StasBekman This is with the NCCL backend? I expect this is due to NCCL's lazy init. NCCL doesn't create the pairs/buffers until the first call under normal usage. Can you try calling init_process_group w/ device_id which should cause eager initialization of the memory

119

rice_fry retweeted

🇺🇦TheSlava

@b0noi

over 1 year ago

Meta has unveiled a resilient training solution for large models with PyTorch: https://t.co/cBQXo8mCuC 🚀 Even better, their detailed design doc is publicly available: https://t.co/J8ueZPC4MZ The solution works generically at the replica level—silencing unhealthy replicas & reintegrating them when healthy. Simple, yes, means many elasticity strategies are not possible, but at the same time highly generalizable and reliable 💡 #AI #PyTorch

945

rice_fry retweeted

Arthur Douillard

@Ar_Douillard

over 1 year ago

one more implementation of DiLoCo to do distributed training! @PyTorch's TorchFT fault tolerance package has an implementation of DiLoCo hopefully soon a Streaming DiLoCo too?

Ar_Douillard's tweet photo. one more implementation of DiLoCo to do distributed training!

@PyTorch's TorchFT fault tolerance package has an implementation of DiLoCo

hopefully soon a Streaming DiLoCo too? https://t.co/7vCujIkOJa

rice_fry retweeted

Stas Bekman

@StasBekman

over 1 year ago

The @PyTorch team are working on a new super important tool: https://t.co/rnfpDuvgOI This repository implements techniques for doing a per-step fault tolerance so you can keep training if errors occur without interrupting the entire training job. Some big companies already have this as a proprietary solution, so it's great this super important solution is worked on for the rest of us. It's a prototype at the moment that already works for DDP

835

128

405

46K

Tristan Rice @rice_fry

over 1 year ago

@timzaman @StasBekman @PyTorch Likewise! I finally get to build the full PyTorch fault tolerance system I've always wanted haha Let me know if you'd like to chat -- I'm sure you have a lot of insights into fault tolerance at scale

126

Tristan Rice @rice_fry

over 1 year ago

@StasBekman @timzaman @PyTorch That's the plan for now -- this repo is a staging ground for these fault tolerance approaches. If things just make sense, we'll upstream it into standard PyTorch Distributed. There's a number of changes we're making in PTD to make torchft (and FT in general) work better

181

Tristan Rice @rice_fry

over 1 year ago

@mattydtweetz @StasBekman @PyTorch The other consideration is if you have a differing number of workers some workers may finish their shard of the dataset before others. We've discussed writing a fault tolerant distributed dataloader that can automatically rebalance but still in the idea phase

Tristan Rice @rice_fry

over 1 year ago

@mattydtweetz @StasBekman @PyTorch That's the case with the stock PyTorch dataloader but not a hard requirement On error the "should_commit" operation returns false and we discard the step -- with a custom dataloader you can detect this and reuse the batch instead

116

Tristan Rice @rice_fry

over 2 years ago

@R_Z_S @alanlcit yeah -- there's a normalized hidden fcw

Tristan Rice @rice_fry

about 5 years ago

Got a sample of the Tesla Insurance telemetry data. The insurance records are on a per drive basis. Here's the fields: * Unique Drive ID * Record Version * Car Firmware Version * Driver Profile Name * Start / End Time * Drive Duration * Start / End Odometer (1/2)

551

Tristan Rice @rice_fry

over 3 years ago

@dav_ell Plenoxel paper is a good read https://t.co/itGsNQdTNo though it uses a sparse voxel octree which isn't feasible to output from a dense NN model. A voxel grid + SH in https://t.co/iBqIeDBc1d would be cool to explore

284

Tristan Rice @rice_fry

almost 4 years ago

Curious what I've been up to in the past 6 months? 😅 I've been working on a novel approach to depth and occupancy understanding for my FSD models! It's much simpler than existing techniques and directly learns the 3D representation ⬇️

rice_fry's tweet photo. Curious what I've been up to in the past 6 months? 😅

I've been working on a novel approach to depth and occupancy understanding for my FSD models!

It's much simpler than existing techniques and directly learns the 3D representation ⬇️ https://t.co/N5Y86GurGj

305

Tristan Rice @rice_fry

over 3 years ago

@dav_ell Yup -- that's basically how NeRF works with a voxel representation. I've tried this a couple of times but my attempts haven't worked out super well -- might be good as an extra loss. Using spherical harmonics is a better option instead of my approach with one color per voxel

rice_fry's tweet photo. @dav_ell Yup -- that's basically how NeRF works with a voxel representation. I've tried this a couple of times but my attempts haven't worked out super well -- might be good as an extra loss. Using spherical harmonics is a better option instead of my approach with one color per voxel https://t.co/Z4A33tdNgw

341

Tristan Rice

@rice_fry

Who to follow

Last Seen Users on Sotwe

Trends for you

Most Popular Users