@SemiAnalysis_ ymmv, but I've run 2K scale H200 and B200 runs with 70B model, up to 3D parallel, with regional torch.compile with no issues. Compile is not distributed aware, so the better method imo is regional compile of the transformer blocks, not full model compile:
https://t.co/rrFXMUprZs
Our TorchTitan Paper has been Accepted to ICLR 2025!
(https://t.co/bJBvPZmQx7)
From the paper chair:
"
I recommend Accept ...:
(a) This is a production-grade framework that covers a wide range of parallelism method and .... is likely to have significant impact
"
We've just released the first version of our Deep Learning Tuning Playbook! This is our attempt to distill our process for actually getting good results with deep learning. We emphasize hyperparameter tuning since it has been a large pain point. https://t.co/PjeJVWeOzS
Excellent article on vector processing on CPU and GPU. If you want to dive deeper into what's happening internally and the tradeoffs involved, this covers SIMD, SIMT, blocks, warps, and more: https://t.co/rG38IizSoT
@IOHK_Charles This is a badge of honor. BitCoin went through the very same thing with Wikipedia in 2010 or so (for real!). Basically,imo this annoints ADA as the successor to BTC. That said, I'll go update the page tomorrow with some links/news to make sure it stays.