Today we’re launching TorchPass — Workload Fault Tolerance.
GPU failures, network disruptions, driver bugs — every fault forces a full job restart. Hours of compute, gone.
TorchPass makes faults invisible to the workload. Training continues. No restarts. No lost progress.
The AI infrastructure utilization gap isn't a compute problem.
It's a networking problem.
And the fabric decisions that close it are being made right now.
Roy Chua, Suresh Vasudevan, and Balaji Prabhakar broke down what’s changing in AI infrastructure economics. 🧵
Where the panel landed:
Extreme co-design is the pragmatic answer today.
But software that can optimize for workload-aware network fabrics wins long term.
The teams that understand where that transition happens will make better fabric decisions now.
The choice isn't just about bandwidth. It's about what you're locking into: the congestion control model, the failure domain boundaries, the observability surface, the upgrade path when the next silicon generation ships. Co-designed platforms get you integration. Open alternatives get you optionality. Neither is free.
Most teams don't have a scale-up strategy. They have a vendor relationship and a ship date. At the capital intensity of today's AI infrastructure, that's an expensive way to make an architectural decision.
🔗 Live panel: NVLink, UALink, ESUN, and the fabric decisions that compound at scale
https://t.co/yf4Xgk5gli
NVLink, UALink, or ESUN — do you have a scale-up strategy, or are you buying what ships first?
The scale-up interconnect market is no longer NVIDIA's alone. UALink is gaining traction. ESUN is in the mix. Multi-vendor competition at the node level is real for the first time, and the procurement decision you make now is an architectural commitment that compounds across every training run, every inference workload, every dollar of GPU spend for the next several years.
400+ interruptions in 54 days. That was Meta's Llama 3 training run.
What's your number — and do you even know?
Meta published theirs. Most teams don't track it at that resolution, which means they're absorbing the same cost without the visibility to act on it. Every interruption is a rollback, a checkpoint reload, a synchronization restart — GPU-hours consumed by recovery, not by model-advancing work.
At 16,000+ GPUs, the math compounds fast. A cluster that fails every few hours doesn't just lose the time to recover. It loses the training progress since the last checkpoint, plus the overhead of getting all workers back to a synchronized state, plus the ripple effects on everything scheduled after it.
Meta had the engineering depth to instrument, publish, and learn from that number. Most organizations running serious training workloads are flying with less visibility than that — which means the cost is there, it's just not visible yet.
Knowing your interruption rate is step one. Having a fabric architecture that reduces it — and recovers faster when it happens — is the actual lever.
🔗 Live panel on the networking decisions behind AI cluster reliability: https://t.co/4WlgPJKcwv
#AIInfrastructure #GPUClusters #DistributedTraining #AIFabric #DataCenterNetworking
Your fabric is three networks pretending to be one. Are you designing for that — or hoping it holds?
Scale-up connects GPUs within a node. Scale-out connects nodes within a pod. Scale-across carries traffic between buildings. Each has different physics, different failure modes, and a different blast radius when something goes wrong.
Most clusters treat them as a unified fabric. They aren't.
The assumptions that hold at 256 GPUs quietly break at 4,096 — not because the hardware changed, but because the interactions between the three layers compound at scale. A congestion event in the scale-out fabric shows up as a training stall that looks like a compute problem. A thermal event in the scale-up layer propagates in ways the scale-across layer can't see. The failure modes multiply across the seams.
The teams getting this right are asking different questions before the cluster is built: What's our congestion control strategy at each layer? Where does the blast radius of a failure stop? How do we instrument across all three without three separate forensics teams?
NVLink, UALink, or ESUN isn't a procurement question. It's an architectural commitment with consequences that compound.
🔗Live panel on the fabric decisions reshaping AI infrastructure economics: https://t.co/4WlgPJKcwv
Watch Lerna review the full alphabet of failure — NCCL, RDMA, queue pairs, link flapping, GPU failures — with illustrations and live demos.
🔗https://t.co/kMTEp6Tp0a
#AIInfrastructure#DistributedTraining#FaultTolerance
N is for NCCL, which timed out at dawn. F is for Flapping, a whispering dread. Q is for Queue pair, precise and exact — one stalls on the fabric and all lose the track.
Edward Gorey wrote an alphabet of inevitable disasters. https://t.co/BXAeZpbDtP’s Solutions Engineer Lerna Ekmekcioglu thinks he'd have been right at home writing about AI training at scale. 🧵
What that actually costs: → Slow queue pairs degrade throughput by 50% → A NIC fails, the job dies — reviving the NIC doesn't revive the job → Work lost is entirely determined by your last checkpoint
𝟗𝟐% 𝐨𝐟 𝐭𝐡𝐞 𝐜𝐨𝐬𝐭 𝐝𝐢𝐟𝐟𝐞𝐫𝐞𝐧𝐜𝐞 𝐛𝐞𝐭𝐰𝐞𝐞𝐧 𝐆𝐏𝐔 𝐜𝐥𝐨𝐮𝐝 𝐩𝐫𝐨𝐯𝐢𝐝𝐞𝐫𝐬 𝐡𝐚𝐬 𝐧𝐨𝐭𝐡𝐢𝐧𝐠 𝐭𝐨 𝐝𝐨 𝐰𝐢𝐭𝐡 𝐆𝐏𝐔 𝐩𝐫𝐢𝐜𝐢𝐧𝐠.
It’s goodput: how much of your cluster is doing useful work vs recovering from failures.
SemiAnalysis modeled a 5,184 GB300 NVL72 cluster:
• TorchPass: 6% goodput expense
• Checkpointless: 10.53%
• Checkpoint restart: 20.91%
At Llama 3 scale, checkpoint restart can leave ~80% of GPUs idle after repeated failures.
As @SemiAnalysis_ put it, TorchPass is “the closest we've seen to what Frontier Labs are using at this scale.”
Infrastructure efficiency is becoming the real AI scaling advantage.
#AIInfrastructure #GPUClusters #Goodput #FaultTolerance #semianalysislinuxwebinar
At 4,096 GPUs, your training job's MTBF is hours, not days.
Every interruption: (t_id + t_failover)·j_size + t_repair·b_radius, times $/GPU-hr.
At $4/GPU-hr, an hour of idle 4k GPUs is ~$16k. Per incident.🧵
Why it works at scale: soft failures dominate above 1k GPUs. ECCs accumulating, GPU off the bus, NVLink errors, link flaps, slow-degrading nodes.
Scheduler catches the signal before NCCL stalls. You migrate while the node can still hand off state cleanly.