@elonmusk Redundancy - Run multiple identical systems in parallel (e.g., three GPUs or nodes voting on results). If one fails, the others take over seamlessly.
Spares - Launch extra GPUs/nodes beyond immediate needs.
Software - Fault-tolerant distributed training/inference frameworks.