LLM, Generative AI, ASIC/FPGA design, statistical inference, applied researcher, AI accellerators, AI in EDA | expert witness | Loves Israel and the USA!
Why I love me a little Gaussian.
A Gaussian has a kind of a miraculous quality to it: it breathes in and out as you adjust σ, it shifts its center of gravity with μ. It responds to parameters the way a living system responds to stimuli.
Like life itself, there's also something self-organizing about it. It emerges naturally from the Central Limit Theorem as the inevitable shape that independent random processes converge toward, no matter what their starting underlying distribution looked like. It's less of a formula you impose and more of a form that nature arrives at on its own.
Plus it makes our life as engineers easy so many times.
One of the most recent examples for it, is the Google latest paper called “TurboQuant” about lossless quantizing of the KV caches of LLMs without using or keeping in memory any zero-points or scaling factors. I shed more light on how they achieved that in the post linked below.
https://t.co/6cYi1bgEqb
𝗧𝘂𝗿𝗯𝗼𝗤𝘂𝗮𝗻𝘁: 𝗪𝗵𝗮𝘁 𝗘𝘃𝗲𝗿𝘆𝗼𝗻𝗲 𝗜𝘀 𝗠𝗶𝘀𝘀𝗶𝗻𝗴
I've read a dozen posts about Google's TurboQuant this week. They all say "random rotation makes the distribution predictable, so you can drop the quantization overhead." That's the what. Nobody explained the how and why, which is by far the beautiful part.
I wrote a companion post about why the Gaussian is so central to this (linked below). Engineering deep-dive:
𝗧𝗵𝗲 𝗛𝗮𝗱𝗮𝗺𝗮𝗿𝗱 𝗧𝗿𝗮𝗻𝘀𝗳𝗼𝗿𝗺: 𝗗𝗼𝗶𝗻𝗴 𝗧𝘄𝗼 𝗧𝗵𝗶𝗻𝗴𝘀 𝗮𝘁 𝗢𝗻𝗰𝗲
The "random rotation" is a Randomized Hadamard Transform: y = H·D·x, where H is the Hadamard matrix and D is a diagonal of random ±1 flips. Two things happen simultaneously:
① 𝗜𝗻𝗱𝗲𝗽𝗲𝗻𝗱𝗲𝗻𝗰𝗲. H is orthogonal, D randomizes phase. Output coordinates become approximately i.i.d. Nobody is highlighting this.
② 𝗦𝗽𝗲𝗲𝗱. Full random rotation costs O(d²). Hadamard has butterfly structure (like FFT), so H·D·x costs O(d log d). Practical on every token at inference time.
𝗧𝗵𝗲 𝗖𝗲𝗻𝘁𝗿𝗮𝗹 𝗟𝗶𝗺𝗶𝘁 𝗧𝗵𝗲𝗼𝗿𝗲𝗺: 𝗧𝗵𝗲 𝗘𝗻𝗴𝗶𝗻𝗲 𝗡𝗼𝗯𝗼𝗱𝘆 𝗠𝗲𝗻𝘁𝗶𝗼𝗻𝗲𝗱
Each output coordinate yᵢ = (1/√d) Σⱼ Hᵢⱼ·Dⱼ·xⱼ is a sum of d independent mean-zero random variables (x is fixed, randomness comes from D).
By CLT: yᵢ ≈ 𝒩(0, ‖x‖²/d)
Not an approximation. A mathematical guarantee via the Lindeberg condition, which holds whenever energy is spread across coordinates.
𝗚𝗮𝘂𝘀𝘀𝗶𝗮𝗻 → 𝗨𝗻𝗶𝗳𝗼𝗿𝗺 𝗔𝗻𝗴𝗹𝗲𝘀 → 𝗭𝗲𝗿𝗼 𝗢𝘃𝗲𝗿𝗵𝗲𝗮𝗱
PolarQuant pairs coordinates and converts to polar: (y₁,y₂) → (r, θ). For two i.i.d. zero-mean Gaussians, θ is distributed Uniform[0, 2π). Classical result from rotational symmetry.
Uniform = dream scenario for quantization. Equal-width bins are optimal. Boundaries are fixed: binᵢ = i·2π/2ᵏ. No per-block scale factors. No zero-points. Hardcoded for every vector, every layer, every token.
The radii recurse up a binary tree (log₂d levels) to one final radius = ‖x‖, stored once in FP16. Compare: traditional methods store 2 FP16 constants per block. At block size 32, that's 1.0 bit/element of overhead. When your budget is 2-3 bits, that's 33-50% wasted on bookkeeping. PolarQuant's overhead at d=128: 0.125 bits/element. At d=512: 0.03. It vanishes as d grows.
𝗧𝗵𝗲 𝗖𝗼𝗺𝗽𝗹𝗲𝘁𝗲 𝗖𝗵𝗮𝗶𝗻
Hadamard → CLT → Gaussian → Uniform angles → Hardcoded bins → Zero overhead
Again:the random rotation enables CLT, CLT gives Gaussianity, Gaussianity gives angular uniformity, uniformity eliminates data-dependent parameters.
QJL then spends 1 bit correcting residual inner-product bias. Result: quality-neutral at 3.5 bits (4.6x over FP16), 2.5 bits (6.4x) with marginal degradation, up to 8x attention speedup on H100.
Many posts quote "6x with zero accuracy loss." The paper is more precise: zero loss is at 3.5 bits (4.6x). The 6x comes at 2.5 bits with measurable quality cost. Both impressive, but conflating them is the kind of imprecision that makes LinkedIn less useful.
https://t.co/hhcFZnxZIx
𝗧𝘂𝗿𝗯𝗼𝗤𝘂𝗮𝗻𝘁: 𝗪𝗵𝗮𝘁 𝗘𝘃𝗲𝗿𝘆𝗼𝗻𝗲 𝗜𝘀 𝗠𝗶𝘀𝘀𝗶𝗻𝗴
I've read a dozen posts about Google's TurboQuant this week. They all say "random rotation makes the distribution predictable, so you can drop the quantization overhead." That's the what. Nobody explained the how and why, which is by far the beautiful part.
I wrote a companion post about why the Gaussian is so central to this (linked below). Engineering deep-dive:
𝗧𝗵𝗲 𝗛𝗮𝗱𝗮𝗺𝗮𝗿𝗱 𝗧𝗿𝗮𝗻𝘀𝗳𝗼𝗿𝗺: 𝗗𝗼𝗶𝗻𝗴 𝗧𝘄𝗼 𝗧𝗵𝗶𝗻𝗴𝘀 𝗮𝘁 𝗢𝗻𝗰𝗲
The "random rotation" is a Randomized Hadamard Transform: y = H·D·x, where H is the Hadamard matrix and D is a diagonal of random ±1 flips. Two things happen simultaneously:
① 𝗜𝗻𝗱𝗲𝗽𝗲𝗻𝗱𝗲𝗻𝗰𝗲. H is orthogonal, D randomizes phase. Output coordinates become approximately i.i.d. Nobody is highlighting this.
② 𝗦𝗽𝗲𝗲𝗱. Full random rotation costs O(d²). Hadamard has butterfly structure (like FFT), so H·D·x costs O(d log d). Practical on every token at inference time.
𝗧𝗵𝗲 𝗖𝗲𝗻𝘁𝗿𝗮𝗹 𝗟𝗶𝗺𝗶𝘁 𝗧𝗵𝗲𝗼𝗿𝗲𝗺: 𝗧𝗵𝗲 𝗘𝗻𝗴𝗶𝗻𝗲 𝗡𝗼𝗯𝗼𝗱𝘆 𝗠𝗲𝗻𝘁𝗶𝗼𝗻𝗲𝗱
Each output coordinate yᵢ = (1/√d) Σⱼ Hᵢⱼ·Dⱼ·xⱼ is a sum of d independent mean-zero random variables (x is fixed, randomness comes from D).
By CLT: yᵢ ≈ 𝒩(0, ‖x‖²/d)
Not an approximation. A mathematical guarantee via the Lindeberg condition, which holds whenever energy is spread across coordinates.
𝗚𝗮𝘂𝘀𝘀𝗶𝗮𝗻 → 𝗨𝗻𝗶𝗳𝗼𝗿𝗺 𝗔𝗻𝗴𝗹𝗲𝘀 → 𝗭𝗲𝗿𝗼 𝗢𝘃𝗲𝗿𝗵𝗲𝗮𝗱
PolarQuant pairs coordinates and converts to polar: (y₁,y₂) → (r, θ). For two i.i.d. zero-mean Gaussians, θ is distributed Uniform[0, 2π). Classical result from rotational symmetry.
Uniform = dream scenario for quantization. Equal-width bins are optimal. Boundaries are fixed: binᵢ = i·2π/2ᵏ. No per-block scale factors. No zero-points. Hardcoded for every vector, every layer, every token.
The radii recurse up a binary tree (log₂d levels) to one final radius = ‖x‖, stored once in FP16. Compare: traditional methods store 2 FP16 constants per block. At block size 32, that's 1.0 bit/element of overhead. When your budget is 2-3 bits, that's 33-50% wasted on bookkeeping. PolarQuant's overhead at d=128: 0.125 bits/element. At d=512: 0.03. It vanishes as d grows.
𝗧𝗵𝗲 𝗖𝗼𝗺𝗽𝗹𝗲𝘁𝗲 𝗖𝗵𝗮𝗶𝗻
Hadamard → CLT → Gaussian → Uniform angles → Hardcoded bins → Zero overhead
Again:the random rotation enables CLT, CLT gives Gaussianity, Gaussianity gives angular uniformity, uniformity eliminates data-dependent parameters.
QJL then spends 1 bit correcting residual inner-product bias. Result: quality-neutral at 3.5 bits (4.6x over FP16), 2.5 bits (6.4x) with marginal degradation, up to 8x attention speedup on H100.
Many posts quote "6x with zero accuracy loss." The paper is more precise: zero loss is at 3.5 bits (4.6x). The 6x comes at 2.5 bits with measurable quality cost. Both impressive, but conflating them is the kind of imprecision that makes LinkedIn less useful.
https://t.co/hhcFZnxZIx
There is understanding that comes from watching a youtube video of a clean derivation and nodding along, and then there is the real kind, which only comes from picking up a pen, making mistakes and arriving at the answer yourself. These are the kind that stay with you. Become part of you.
Below is my own pen-and-paper derivation of the Nesterov Momentum (with λ as the momentum coefficient). The key is moving into the θ̂-space which are the lookahead coordinates, and eliminating the raw θ entirely. Once you do that, the final update rule falls out cleanly.
This simple but ingenious mathematical trick upgraded convergence of gradient descent from O(1/t) to the optimal O(1/t²) in any locally convex region.
AI terminology can be misleading, sometimes may even cause unwanted bugs. The term Convolutional Neural Network implies that it's a form of neural network that actually does or uses, or calculates convolution, when, in fact the mathematical operation that it is doing is CROSS-CORRELATION. For some reason (ignorance or otherwise), whoever coined the term convolution for the math performed in convolutional neural networks ignored the fact that discrete convolution is defined to be the sum of a series times the other series reversed, while in the CNN math, there is no reversal of any series, which renders it to be cross-correlation.
So if you actually calculated convolution for a CNN rather than cross-correlation, your network WILL NOT WORK.
CNN is actually Cross-Correlation-Neural-Network, rather than convolutional.
Details matter! Especially in math!
This is the kind of inflammatory poison that divides our nation and inspires assassins. It’s particularly ironic since Biden/Harris have just pushed through DoD Directive 5240.01 giving the Pentagon power — for the first time in history — to use lethal force to kill Americans on U.S. soil who protest government policies. If you want to understand a politician, the words from her mouth have little relevance. Look at her feet.
This is where the first and last time this ugly coward sewage rat, Yahya Sinwar met with the IDF, when he came out of the burrows of Hamas underground dwellers.
I am an Israeli professor at @Columbia.
This student - a leader in the pro-Hamas movement on campus - has called for me and my fellow Israelis to die.
They should be expelled.
The students standing by them should also be expelled.
Not suspended. Expelled.
@DouglasKMurray rejoicing for us! lol.
There are not many people who can see truth and also have the courage to speak it. He is one of them. Thank you Douglas!
@MosabHasanYOSEF@MosabHasanYOSEF - At the end they are all cry baby cowards. Nobody dares take the lead of this so called Hizb of Allah due to an obvious reason. The successor’s life span is shorter than that of a lab rat.
@BellaWallerstei We didn’t neutralize these sub-human scum by ignoring the UN. We neutralized them by making sure they weren’t around anymore. Ignoring the meaningless UN is just the default mode of operation. Lol
Terror leaders run for cover, but there is no place to hide...
Those who have planned and executed Oct 7 genocide shall DIE. And those who have celebrated their free ride over the shoulders of democracy, spreading their hateful ideologies, shall kneel before the MASTER of Victory and Defeat.
@GadSaad@GadSaad - I think she is becoming more religious. To those who are not ready, it can be dangerous as it replaces logic with faith. Without the balance of critical thinking, unchecked faith can lead to the acceptance of questionable beliefs and propaganda.
Israel just carried out one of the most heroic hostage rescues in history. Leave it to the moral trash of the Left to weep for Hamas and its civilian collaborators, and to blame Israel for civilian deaths.