- 230 training runs
- 1,623 GPU hours (67 B200 days)
- 76 TB of training data
- a 2x faster model
Every paper said it can't be done.
Quantization Aware Distillation made it possible.
there is no such thing as running out of compute. for the right price someone will sell you compute. it’s an elastic resource like all other markets. when RSI arrives running that program will be so valuable that all clouds will mostly shut down and sell compute to the singularity