Will Fast Matrix Multiplication ever be practical?
Strassen’s 1986 discovery of fast matrix multiplication (FMM) – asserting that the product of two 𝑛×𝑛 matrices can be computed in sub-cubic time 𝑛^𝜔 ∼ 𝑛²·⁸⁷ �� 𝑛³ – had a profound impact on theoretical computer science and algorithm design.
Since then, mathematicians improved on Strassen’s algorithm, and some experts believe that, eventually, it will be shown that 𝜔 ≈ 2, which would mean that the time to compute AxB is essentially the time it takes to merely read the inputs: ~O(𝑛²) (!) Needless to say, such result would have a major impact on the AI compute age we’re entering…
Unfortunately, FMM algorithms only work for enormous matrices--on the order of the number of atoms in the universe (“galactic algorithms" [1])--and it is currently hard to imagine them being practical on any imaginable hardware. Besides their asymptotic runtime, a core practical issue with FMM algorithms is that they all inherently rely on recursive divide-and-conquer, which creates memory and IO-bottlenecks, and is numerically unstable; This is likely the reason why the largest hardware manufacturers in the world are not developing chips for FMM. Even Strassen’s original algorithm, which gives nontrivial FLOP speedup for relatively small matrices, struggles to beat the sheer parallelism of naiive MatMul on GPUs or TPUs.
Some interesting progress on practical FMM seems underway [2] and would be interesting to follow, but it remains to be seen whether divide-and-conquer can be implemented in both silicon and memory to deliver wall-clock speedups for realistic dimensions of matrices in LLMs.
What is means for @prlnet. That’s the reason we designed the Pearl proof-of-work protocol (cuPOW) with the underlying baseline being “naiive” matrix multiplication O(𝑛³), which is what NVIDIA, AMD, Cerebras and all other AI hardware accelerators implement today.
Nevertheless, it is important to stress that Pearl’s protocol doesn't rely on naiive MatMul remaining SoTA -- if FMM becomes practical some day, Pearl's protocol can easily adapt to the 𝑛^𝜔 baseline (since the next version of cuPoW will only verify the output AB).
In fact, one of the intriguing aspect of @prlnet is that it creates incentives (for both humans and machines) to develop faster MatMul algorithms and hardware (as had happened in Bitcoin with SHA256). Of course, without proper modification, such breakthrough would break the security assumption of Pearl-GEMM, so such algorithmic breakthrough would better be public.
FMM and FFT. Our recent paper [3] shows that it is possible to achieve fast matrix multiplication without using Strassen-like divide-and-conquer, using only the Fast Fourier Transform, which is omnipresent in countless industry-scale applications. This paper presents a simple algorithm running in 𝑂(𝑛²·⁸⁹) time, which only sums a few convolutions in 𝕫ₖᵐ, using FFT (see figure below for illustration of the algorithm).
Despite being highly parallel (no recursion), this FFT algorithm for MatMul remains asymptotic, as it still requires many parallel repetitions on submatrices in order to obtain noticeable speedup over naiive MatMul (𝑛³). Whether FFT can lead to subcubic time MatMul
for reasonably-sized matrices is a fascinating question!
I believe FFTs are the most promising tool in this direction...
[1] Lipton, Richard J., and Kenneth W. Regan. “David Johnson: Galactic Algorithms.” In People, Problems, and Proofs, 109–112. Springer, 2013. https://t.co/X6N6ViYYai.
[2] Karstadt, Elaye, and Oded Schwartz. “Matrix Multiplication, a Little Faster.” Journal of the ACM 67, no. 1 (2020): 1:1–1:31. https://t.co/VnfiWVLdKK.
[3] Uffenheimer, Yahel, and Omri Weinstein. “Improved Sparse Recovery for Approximate Matrix Multiplication.” arXiv:2602.04386, 2026..