@OosGergVer@__tinygrad__ He should reconsider: Selling chips w solid firmware and OSS stack on top -> other people fixing the OSS stack -> possible alternative to CUDA moat
CUDA is why nvidia is worth that much
@__tinygrad__ These straighforward comparisons to LLMs are an insult to the brain. The brain runs so much more efficient software. LLM vs brain is like Windows 11 vs TempleOS. I liked the estimate in your blog post "Brain FLOPS": 162 GFLOPS.
"deepmriprep: VBM preprocessing via deep neural networks" is published in Nature Computational Science 🧠💻
🔗 Paper: https://t.co/2ZQdiACU9z
VBM preprocessing in ~10 seconds per #brain image 🚀
🔗 GitHub: https://t.co/PIA6NeGr2r…
Install via "pip install deepmriprep"
This really speeds up preprocessing and shows - yet again - that neural networks are eating software.
Great work, @codingfisch !! Happy to see this finally published.
📢Out now! @codingfisch and colleagues present deepmriprep, a tool that leverages neural networks to enable 37x faster Voxel-based Morphometry preprocessing of MRI data than existing methods. https://t.co/WETn1yRrov
@__tinygrad__@ID_AA_Carmack RISC for array ops sounds too elegant to be impossible (as George said @clattner_llvm told him). I hope you will find the right abstractions to crush this problem soon!
Our GPU stack for both NVIDIA and AMD, aside from minimal pieces of signed firmware, is 100% open source and pure Python except for the compiler. It's not using vendor drivers, frameworks, or libraries. That's why it's so easy to make it work on Mac.
For compilers, on AMD, we use upstream LLVM, and on NVIDIA, we use the NAK compiler from the MESA project. We plan to replace the compiler with pure tinygrad in a year or two as well.
With RANGEIFY merged, our lowering stuff now matches the state of the art, TVM style. We're studying ThunderKittens and TileLang for speed at that level, and should have all this stuff ready in 200 days for the due date of our AMD Llama 405B training contract.
Due to tinygrad's small size and pure Python nature, it's the easiest ML library to make progress on, aka fastest slope of improvement. With Megakernel style for scheduling, MODeL_opt style for planning, and E-graph style for symbolic, we should blow past the state of the art in PyTorch and JAX speed.
If we do that, NVIDIA's moat is over. It's 1000 lines at most to add a new accelerator to tinygrad. And I don't mean to add a new accelerator with help from a kernel driver, compiler, and libraries. Just 1000 lines of software for the *whole* accelerator speaking right on the PCIe BARs, like what tinygrad is doing with the NVIDIA and AMD GPUs now.
Theorem: The maximum possible duration of the computational singularity is 470 years.
Proof: The FLOPs capacity of all computers which existed in the year 1986 is estimated to be at most 4.5e14 (Hilbert et al. 2011). Based on public Nvidia revenue and GPU specs, this capacity has grown to at least 1e22 FLOPs as of 2025. This difference implies an average growth rate of 55% per year since 1986. Now observe that the physical universe can support at most 10^104 FLOPs (Lloyd 2000). Therefore, even if we allow for the discovery of faster than light travel, the computational singularity — i.e., the historical period of elevated social and technological unpredictability driven by rapid growth in worldwide computational capacity — cannot persist for longer than (2025 -1986) + (104-22)/log_10(1.55) ~= 470 years.
References:
S. Lloyd, “Ultimate physical limits to computation,” *arXiv preprint quant-ph/9908043*, 1999, doi:10.48550/arXiv.quant-ph/9908043.
M. Hilbert and P. López, “The world’s technological capacity to store, communicate, and compute information,” *Science*, vol. 332, no. 6025, pp. 60–65, Apr. 2011, doi:10.1126/science.1200970.
Good summary but "frontier LLM researchers...shifted a little too much into exploit mode" is an understatement. A large chunk of ALL AI researchers bet on scaling up LLMs to AGI. If this bet fails we spent a lot of researcher FLOPs in a local optimum. New small-scale RL ideas are needed!
@jsuarez@ID_AA_Carmack@clashluke It would really help if you would quantify stuff like this rigorously. Muon vs Adam(W) across different envs (with hyperparameter sweeps if needed) with (~10) different random seeds. Would be interesting to see what changes in puffer bring the largest performance increase