@vr4300 Been slowly getting back into self driving things, figured it'd be fun to hack on the hw4 NPU. Still baby steps though. It would be super cool to run my own models on it
Meta has open sourced their CTran library that natively works with AMD & NVIDIA GPUs 🚀. Previously, if u want multiple NVIDIA GPUs to work together on an workload, you must used the NVIDIA NCCL library. Although NCCL's source code is public, it does not have an open governance model, does not have open CI, employs an "code dump" update model, is not GitHub first, and rarely accepts external contributions. Previously, If you want multiple GPUs to work together on an workload, you must used the AMD fork called RCCL library, which is a delayed fork of NVIDIA's NCCL. With CTran, it is 1 unified library and allows for adding new like Bruck's in an way such that the code can be shared between different AI GPU types.
Furthermore, Meta has open sourced NCCLX (NCCL extended) which is their production-tested collective library that powered all Llama training and uses the unified CTran library. Meta is the creator & main maintainer of PyTorch and is well trusted in the open source community.
NVIDIA continues to be the leader in collective libraries but Jensen must not taken it for granted given the heavily increased competition in the open source collective communication space. Just like how TRTLLM moved to an GitHub first development when facing heavy competition from SGLang/vLLM, Jensen should seriously consider moving NCCL to GitHub first open development model due to the competition in the collective front too. To draw parallel comparisons to the inference engine world, Collective Communication Libraries are moving from the 2021 "FasterTransformer" era to the 2025 "SGLang/vLLM/TRTLLM" era.
The main competitors in the collective library space include China's DeepEP library, AMD's new MORI, AMD's upcoming MORI-CCL, Meta's CTran & NCCLX, NVIDIA's NCCL (which has released their new NCCL Device API, NCCL's new GPU-Initiated Networking, etc). Competition breeds innovation! 🚀
torchft + TorchTitan: 1200+ failures, no checkpoints, model convergence.
A Llama 3 model was trained across 300 L40S GPUs with synthetic failures every 15s. No restarts. No rollbacks. Just asynchronous recovery and continued progress.
📘 https://t.co/PeLDx14CRb
#PyTorch #DistributedTraining #FaultTolerance #OpenSourceAI
If you’re excited about optimizing code that runs equally well on a single or thousands of GPUs and if you have the ability to submit a single substantial PR to a major OSS library, we want you on the PyTorch team - especially if you’re early in your career.
If GPU optimization and systems problems excite you, why limit your impact to a single company or lab?
Working on PyTorch allows you to ship impact to the entire AI industry!
We're hiring across experiences -- junior and senior engineers.
Read more below 👇
@StasBekman That buffer is 4MB per *peer*. How many GPUs are you running with and what network topology and collectives are you using?
I'm trying to figure out if we can track this but AFAIK the PyTorch profiler tracks PyTorch allocations and we can't inspect into NCCL internal allocations
@alex_peys@main_horse@PyTorch I just double checked, the DDP wrapper and any dist.* operation throws an error if init_process_group hasn't been called
torchrun working with any script even if it's not distributed is actually pretty useful -- I've used it for doing bulk inference from time to time
@StasBekman This is with the NCCL backend? I expect this is due to NCCL's lazy init. NCCL doesn't create the pairs/buffers until the first call under normal usage.
Can you try calling init_process_group w/ device_id which should cause eager initialization of the memory
Meta has unveiled a resilient training solution for large models with PyTorch: https://t.co/cBQXo8mCuC 🚀
Even better, their detailed design doc is publicly available: https://t.co/J8ueZPC4MZ
The solution works generically at the replica level—silencing unhealthy replicas & reintegrating them when healthy. Simple, yes, means many elasticity strategies are not possible, but at the same time highly generalizable and reliable 💡 #AI #PyTorch
one more implementation of DiLoCo to do distributed training!
@PyTorch's TorchFT fault tolerance package has an implementation of DiLoCo
hopefully soon a Streaming DiLoCo too?
The @PyTorch team are working on a new super important tool:
https://t.co/rnfpDuvgOI
This repository implements techniques for doing a per-step fault tolerance so you can keep training if errors occur without interrupting the entire training job.
Some big companies already have this as a proprietary solution, so it's great this super important solution is worked on for the rest of us.
It's a prototype at the moment that already works for DDP
@timzaman@StasBekman@PyTorch Likewise! I finally get to build the full PyTorch fault tolerance system I've always wanted haha
Let me know if you'd like to chat -- I'm sure you have a lot of insights into fault tolerance at scale
@StasBekman@timzaman@PyTorch That's the plan for now -- this repo is a staging ground for these fault tolerance approaches. If things just make sense, we'll upstream it into standard PyTorch Distributed.
There's a number of changes we're making in PTD to make torchft (and FT in general) work better
@mattydtweetz@StasBekman@PyTorch The other consideration is if you have a differing number of workers some workers may finish their shard of the dataset before others. We've discussed writing a fault tolerant distributed dataloader that can automatically rebalance but still in the idea phase
@mattydtweetz@StasBekman@PyTorch That's the case with the stock PyTorch dataloader but not a hard requirement
On error the "should_commit" operation returns false and we discard the step -- with a custom dataloader you can detect this and reuse the batch instead
Got a sample of the Tesla Insurance telemetry data. The insurance records are on a per drive basis. Here's the fields:
* Unique Drive ID
* Record Version
* Car Firmware Version
* Driver Profile Name
* Start / End Time
* Drive Duration
* Start / End Odometer
(1/2)
@dav_ell Plenoxel paper is a good read https://t.co/itGsNQdTNo though it uses a sparse voxel octree which isn't feasible to output from a dense NN model. A voxel grid + SH in https://t.co/iBqIeDBc1d would be cool to explore
Curious what I've been up to in the past 6 months? 😅
I've been working on a novel approach to depth and occupancy understanding for my FSD models!
It's much simpler than existing techniques and directly learns the 3D representation ⬇️
@dav_ell Yup -- that's basically how NeRF works with a voxel representation. I've tried this a couple of times but my attempts haven't worked out super well -- might be good as an extra loss. Using spherical harmonics is a better option instead of my approach with one color per voxel