Guy Regev @_guyregev - Twitter Profile

Why I love me a little Gaussian. A Gaussian has a kind of a miraculous quality to it: it breathes in and out as you adjust σ, it shifts its center of gravity with μ. It responds to parameters the way a living system responds to stimuli. Like life itself, there's also something self-organizing about it. It emerges naturally from the Central Limit Theorem as the inevitable shape that independent random processes converge toward, no matter what their starting underlying distribution looked like. It's less of a formula you impose and more of a form that nature arrives at on its own. Plus it makes our life as engineers easy so many times. One of the most recent examples for it, is the Google latest paper called “TurboQuant” about lossless quantizing of the KV caches of LLMs without using or keeping in memory any zero-points or scaling factors. I shed more light on how they achieved that in the post linked below. https://t.co/6cYi1bgEqb

Guy Regev

@_guyregev

3 months ago

𝗧𝘂𝗿𝗯𝗼𝗤𝘂𝗮𝗻𝘁: 𝗪𝗵𝗮𝘁 𝗘𝘃𝗲𝗿𝘆𝗼𝗻𝗲 𝗜𝘀 𝗠𝗶𝘀𝘀𝗶𝗻𝗴 I've read a dozen posts about Google's TurboQuant this week. They all say "random rotation makes the distribution predictable, so you can drop the quantization overhead." That's the what. Nobody explained the how and why, which is by far the beautiful part. I wrote a companion post about why the Gaussian is so central to this (linked below). Engineering deep-dive: 𝗧𝗵𝗲 𝗛𝗮𝗱𝗮𝗺𝗮𝗿𝗱 𝗧𝗿𝗮𝗻𝘀𝗳𝗼𝗿𝗺: 𝗗𝗼𝗶𝗻𝗴 𝗧𝘄𝗼 𝗧𝗵𝗶𝗻𝗴𝘀 𝗮𝘁 𝗢𝗻𝗰𝗲 The "random rotation" is a Randomized Hadamard Transform: y = H·D·x, where H is the Hadamard matrix and D is a diagonal of random ±1 flips. Two things happen simultaneously: ① 𝗜𝗻𝗱𝗲𝗽𝗲𝗻𝗱𝗲𝗻𝗰𝗲. H is orthogonal, D randomizes phase. Output coordinates become approximately i.i.d. Nobody is highlighting this. ② 𝗦𝗽𝗲𝗲𝗱. Full random rotation costs O(d²). Hadamard has butterfly structure (like FFT), so H·D·x costs O(d log d). Practical on every token at inference time. 𝗧𝗵𝗲 𝗖𝗲𝗻𝘁𝗿𝗮𝗹 𝗟𝗶𝗺𝗶𝘁 𝗧𝗵𝗲𝗼𝗿𝗲𝗺: 𝗧𝗵𝗲 𝗘𝗻𝗴𝗶𝗻𝗲 𝗡𝗼𝗯𝗼𝗱𝘆 𝗠𝗲𝗻𝘁𝗶𝗼𝗻𝗲𝗱 Each output coordinate yᵢ = (1/√d) Σⱼ Hᵢⱼ·Dⱼ·xⱼ is a sum of d independent mean-zero random variables (x is fixed, randomness comes from D). By CLT: yᵢ ≈ 𝒩(0, ‖x‖²/d) Not an approximation. A mathematical guarantee via the Lindeberg condition, which holds whenever energy is spread across coordinates. 𝗚𝗮𝘂𝘀𝘀𝗶𝗮𝗻 → 𝗨𝗻𝗶𝗳𝗼𝗿𝗺 𝗔𝗻𝗴𝗹𝗲𝘀 → 𝗭𝗲𝗿𝗼 𝗢𝘃𝗲𝗿𝗵𝗲𝗮𝗱 PolarQuant pairs coordinates and converts to polar: (y₁,y₂) → (r, θ). For two i.i.d. zero-mean Gaussians, θ is distributed Uniform[0, 2π). Classical result from rotational symmetry. Uniform = dream scenario for quantization. Equal-width bins are optimal. Boundaries are fixed: binᵢ = i·2π/2ᵏ. No per-block scale factors. No zero-points. Hardcoded for every vector, every layer, every token. The radii recurse up a binary tree (log₂d levels) to one final radius = ‖x‖, stored once in FP16. Compare: traditional methods store 2 FP16 constants per block. At block size 32, that's 1.0 bit/element of overhead. When your budget is 2-3 bits, that's 33-50% wasted on bookkeeping. PolarQuant's overhead at d=128: 0.125 bits/element. At d=512: 0.03. It vanishes as d grows. 𝗧𝗵𝗲 𝗖𝗼𝗺𝗽𝗹𝗲𝘁𝗲 𝗖𝗵𝗮𝗶𝗻 Hadamard → CLT → Gaussian → Uniform angles → Hardcoded bins → Zero overhead Again:the random rotation enables CLT, CLT gives Gaussianity, Gaussianity gives angular uniformity, uniformity eliminates data-dependent parameters. QJL then spends 1 bit correcting residual inner-product bias. Result: quality-neutral at 3.5 bits (4.6x over FP16), 2.5 bits (6.4x) with marginal degradation, up to 8x attention speedup on H100. Many posts quote "6x with zero accuracy loss." The paper is more precise: zero loss is at 3.5 bits (4.6x). The 6x comes at 2.5 bits with measurable quality cost. Both impressive, but conflating them is the kind of imprecision that makes LinkedIn less useful. https://t.co/hhcFZnxZIx

_guyregev's tweet photo. 𝗧𝘂𝗿𝗯𝗼𝗤𝘂𝗮𝗻𝘁: 𝗪𝗵𝗮𝘁 𝗘𝘃𝗲𝗿𝘆𝗼𝗻𝗲 𝗜𝘀 𝗠𝗶𝘀𝘀𝗶𝗻𝗴
I've read a dozen posts about Google's TurboQuant this week. They all say "random rotation makes the distribution predictable, so you can drop the quantization overhead." That's the what. Nobody explained the how and why, which is by far the beautiful part.
I wrote a companion post about why the Gaussian is so central to this (linked below). Engineering deep-dive:

𝗧𝗵𝗲 𝗛𝗮𝗱𝗮𝗺𝗮𝗿𝗱 𝗧𝗿𝗮𝗻𝘀𝗳𝗼𝗿𝗺: 𝗗𝗼𝗶𝗻𝗴 𝗧𝘄𝗼 𝗧𝗵𝗶𝗻𝗴𝘀 𝗮𝘁 𝗢𝗻𝗰𝗲
The "random rotation" is a Randomized Hadamard Transform: y = H·D·x, where H is the Hadamard matrix and D is a diagonal of random ±1 flips. Two things happen simultaneously:
① 𝗜𝗻𝗱𝗲𝗽𝗲𝗻𝗱𝗲𝗻𝗰𝗲. H is orthogonal, D randomizes phase. Output coordinates become approximately i.i.d. Nobody is highlighting this.
② 𝗦𝗽𝗲𝗲𝗱. Full random rotation costs O(d²). Hadamard has butterfly structure (like FFT), so H·D·x costs O(d log d). Practical on every token at inference time.
𝗧𝗵𝗲 𝗖𝗲𝗻𝘁𝗿𝗮𝗹 𝗟𝗶𝗺𝗶𝘁 𝗧𝗵𝗲𝗼𝗿𝗲𝗺: 𝗧𝗵𝗲 𝗘𝗻𝗴𝗶𝗻𝗲 𝗡𝗼𝗯𝗼𝗱𝘆 𝗠𝗲𝗻𝘁𝗶𝗼𝗻𝗲𝗱
Each output coordinate yᵢ = (1/√d) Σⱼ Hᵢⱼ·Dⱼ·xⱼ is a sum of d independent mean-zero random variables (x is fixed, randomness comes from D).

By CLT: yᵢ ≈ 𝒩(0, ‖x‖²/d)

Not an approximation. A mathematical guarantee via the Lindeberg condition, which holds whenever energy is spread across coordinates.

𝗚𝗮𝘂𝘀𝘀𝗶𝗮𝗻 → 𝗨𝗻𝗶𝗳𝗼𝗿𝗺 𝗔𝗻𝗴𝗹𝗲𝘀 → 𝗭𝗲𝗿𝗼 𝗢𝘃𝗲𝗿𝗵𝗲𝗮𝗱
PolarQuant pairs coordinates and converts to polar: (y₁,y₂) → (r, θ). For two i.i.d. zero-mean Gaussians, θ is distributed Uniform[0, 2π). Classical result from rotational symmetry.
Uniform = dream scenario for quantization. Equal-width bins are optimal. Boundaries are fixed: binᵢ = i·2π/2ᵏ. No per-block scale factors. No zero-points. Hardcoded for every vector, every layer, every token.
The radii recurse up a binary tree (log₂d levels) to one final radius = ‖x‖, stored once in FP16. Compare: traditional methods store 2 FP16 constants per block. At block size 32, that's 1.0 bit/element of overhead. When your budget is 2-3 bits, that's 33-50% wasted on bookkeeping. PolarQuant's overhead at d=128: 0.125 bits/element. At d=512: 0.03. It vanishes as d grows.

𝗧𝗵𝗲 𝗖𝗼𝗺𝗽𝗹𝗲𝘁𝗲 𝗖𝗵𝗮𝗶𝗻
Hadamard → CLT → Gaussian → Uniform angles → Hardcoded bins → Zero overhead

Again:the random rotation enables CLT, CLT gives Gaussianity, Gaussianity gives angular uniformity, uniformity eliminates data-dependent parameters.

QJL then spends 1 bit correcting residual inner-product bias. Result: quality-neutral at 3.5 bits (4.6x over FP16), 2.5 bits (6.4x) with marginal degradation, up to 8x attention speedup on H100.
Many posts quote "6x with zero accuracy loss." The paper is more precise: zero loss is at 3.5 bits (4.6x). The 6x comes at 2.5 bits with measurable quality cost. Both impressive, but conflating them is the kind of imprecision that makes LinkedIn less useful.

https://t.co/hhcFZnxZIx

0

28

0

1

23

Guy Regev

@_guyregev

3 months ago

𝗧𝘂𝗿𝗯𝗼𝗤𝘂𝗮𝗻𝘁: 𝗪𝗵𝗮𝘁 𝗘𝘃𝗲𝗿𝘆𝗼𝗻𝗲 𝗜𝘀 𝗠𝗶𝘀𝘀𝗶𝗻𝗴 I've read a dozen posts about Google's TurboQuant this week. They all say "random rotation makes the distribution predictable, so you can drop the quantization overhead." That's the what. Nobody explained the how and why, which is by far the beautiful part. I wrote a companion post about why the Gaussian is so central to this (linked below). Engineering deep-dive: 𝗧𝗵𝗲 𝗛𝗮𝗱𝗮𝗺𝗮𝗿𝗱 𝗧𝗿𝗮𝗻𝘀𝗳𝗼𝗿𝗺: 𝗗𝗼𝗶𝗻𝗴 𝗧𝘄𝗼 𝗧𝗵𝗶𝗻𝗴𝘀 𝗮𝘁 𝗢𝗻𝗰𝗲 The "random rotation" is a Randomized Hadamard Transform: y = H·D·x, where H is the Hadamard matrix and D is a diagonal of random ±1 flips. Two things happen simultaneously: ① 𝗜𝗻𝗱𝗲𝗽𝗲𝗻𝗱𝗲𝗻𝗰𝗲. H is orthogonal, D randomizes phase. Output coordinates become approximately i.i.d. Nobody is highlighting this. ② 𝗦𝗽𝗲𝗲𝗱. Full random rotation costs O(d²). Hadamard has butterfly structure (like FFT), so H·D·x costs O(d log d). Practical on every token at inference time. 𝗧𝗵𝗲 𝗖𝗲𝗻𝘁𝗿𝗮𝗹 𝗟𝗶𝗺𝗶𝘁 𝗧𝗵𝗲𝗼𝗿𝗲𝗺: 𝗧𝗵𝗲 𝗘𝗻𝗴𝗶𝗻𝗲 𝗡𝗼𝗯𝗼𝗱𝘆 𝗠𝗲𝗻𝘁𝗶𝗼𝗻𝗲𝗱 Each output coordinate yᵢ = (1/√d) Σⱼ Hᵢⱼ·Dⱼ·xⱼ is a sum of d independent mean-zero random variables (x is fixed, randomness comes from D). By CLT: yᵢ ≈ 𝒩(0, ‖x‖²/d) Not an approximation. A mathematical guarantee via the Lindeberg condition, which holds whenever energy is spread across coordinates. 𝗚𝗮𝘂𝘀𝘀𝗶𝗮𝗻 → 𝗨𝗻𝗶𝗳𝗼𝗿𝗺 𝗔𝗻𝗴𝗹𝗲𝘀 → 𝗭𝗲𝗿𝗼 𝗢𝘃𝗲𝗿𝗵𝗲𝗮𝗱 PolarQuant pairs coordinates and converts to polar: (y₁,y₂) → (r, θ). For two i.i.d. zero-mean Gaussians, θ is distributed Uniform[0, 2π). Classical result from rotational symmetry. Uniform = dream scenario for quantization. Equal-width bins are optimal. Boundaries are fixed: binᵢ = i·2π/2ᵏ. No per-block scale factors. No zero-points. Hardcoded for every vector, every layer, every token. The radii recurse up a binary tree (log₂d levels) to one final radius = ‖x‖, stored once in FP16. Compare: traditional methods store 2 FP16 constants per block. At block size 32, that's 1.0 bit/element of overhead. When your budget is 2-3 bits, that's 33-50% wasted on bookkeeping. PolarQuant's overhead at d=128: 0.125 bits/element. At d=512: 0.03. It vanishes as d grows. 𝗧𝗵𝗲 𝗖𝗼𝗺𝗽𝗹𝗲𝘁𝗲 𝗖𝗵𝗮𝗶𝗻 Hadamard → CLT → Gaussian → Uniform angles → Hardcoded bins → Zero overhead Again:the random rotation enables CLT, CLT gives Gaussianity, Gaussianity gives angular uniformity, uniformity eliminates data-dependent parameters. QJL then spends 1 bit correcting residual inner-product bias. Result: quality-neutral at 3.5 bits (4.6x over FP16), 2.5 bits (6.4x) with marginal degradation, up to 8x attention speedup on H100. Many posts quote "6x with zero accuracy loss." The paper is more precise: zero loss is at 3.5 bits (4.6x). The 6x comes at 2.5 bits with measurable quality cost. Both impressive, but conflating them is the kind of imprecision that makes LinkedIn less useful. https://t.co/hhcFZnxZIx

0

28

Who to follow

Nicolas Brusco

@Brusco_RF

Software sucks! Write hardware. #FPGA

Joe Amspoker

@JoeAmspoker

Former Moderate Democrat. The food pyramid was the first big lie. Let's talk health and fitness #Peptides #Health #AmericaFirst

Pieter

@p_doevendans

Founder @Stealth healthcare company. Startup advisor. Co-founder & director @AvaScribe.

Guy Regev

@_guyregev

4 months ago

There is understanding that comes from watching a youtube video of a clean derivation and nodding along, and then there is the real kind, which only comes from picking up a pen, making mistakes and arriving at the answer yourself. These are the kind that stay with you. Become part of you. Below is my own pen-and-paper derivation of the Nesterov Momentum (with λ as the momentum coefficient). The key is moving into the θ̂-space which are the lookahead coordinates, and eliminating the raw θ entirely. Once you do that, the final update rule falls out cleanly. This simple but ingenious mathematical trick upgraded convergence of gradient descent from O(1/t) to the optimal O(1/t²) in any locally convex region.

_guyregev's tweet photo. There is understanding that comes from watching a youtube video of a clean derivation and nodding along, and then there is the real kind, which only comes from picking up a pen, making mistakes and arriving at the answer yourself. These are the kind that stay with you. Become part of you.

Below is my own pen-and-paper derivation of the Nesterov Momentum (with λ as the momentum coefficient). The key is moving into the θ̂-space which are the lookahead coordinates, and eliminating the raw θ entirely. Once you do that, the final update rule falls out cleanly.

This simple but ingenious mathematical trick upgraded convergence of gradient descent from O(1/t) to the optimal O(1/t²) in any locally convex region.

0

1

0

9

Guy Regev

@_guyregev

7 months ago

AI terminology can be misleading, sometimes may even cause unwanted bugs. The term Convolutional Neural Network implies that it's a form of neural network that actually does or uses, or calculates convolution, when, in fact the mathematical operation that it is doing is CROSS-CORRELATION. For some reason (ignorance or otherwise), whoever coined the term convolution for the math performed in convolutional neural networks ignored the fact that discrete convolution is defined to be the sum of a series times the other series reversed, while in the CNN math, there is no reversal of any series, which renders it to be cross-correlation. So if you actually calculated convolution for a CNN rather than cross-correlation, your network WILL NOT WORK. CNN is actually Cross-Correlation-Neural-Network, rather than convolutional. Details matter! Especially in math!

0

2

0

36

Guy Regev

@_guyregev

7 months ago

@karanjakhar88 a bug overleaf dot com

1

0

28

Guy Regev

@_guyregev

over 1 year ago

I actually think the words are meaningful too. Carl Jung called it “Projection”.

Robert F. Kennedy Jr

@RobertKennedyJr

over 1 year ago

This is the kind of inflammatory poison that divides our nation and inspires assassins. It’s particularly ironic since Biden/Harris have just pushed through DoD Directive 5240.01 giving the Pentagon power — for the first time in history — to use lethal force to kill Americans on U.S. soil who protest government policies. If you want to understand a politician, the words from her mouth have little relevance. Look at her feet.

9K

206K

60K

14K

10M

0

1

0

43

Guy Regev

@_guyregev

over 1 year ago

@DouglasKMurray Sitting where the Nazi filth sewage rat met with the IDF for the first and last time during this war.

0

4

Guy Regev

@_guyregev

over 1 year ago

This is where the first and last time this ugly coward sewage rat, Yahya Sinwar met with the IDF, when he came out of the burrows of Hamas underground dwellers.

Douglas Murray

@DouglasKMurray

over 1 year ago

Exclusive: Inside the utter devastation where Hamas despot Yahya Sinwar met his demise in Gaza. ⁦@nypost⁩ https://t.co/jkjEbl7mjg

308

5K

741

146

253K

0

1

0

37

Guy Regev

@_guyregev

over 1 year ago

@MosabHasanYOSEF Sewer rats have stinky mouths.

0

1

0

16

Guy Regev

@_guyregev

over 1 year ago

@ShaiDavidai - Expulsion goes without saying. Calling for the murder of others should be handled with a criminal trial.

Shai Davidai

@ShaiDavidai

over 1 year ago

I am an Israeli professor at @Columbia. This student - a leader in the pro-Hamas movement on campus - has called for me and my fellow Israelis to die. They should be expelled. The students standing by them should also be expelled. Not suspended. Expelled.

4K

68K

13K

1K

2M

0

2

Guy Regev

@_guyregev

over 1 year ago

@DouglasKMurray rejoicing for us! lol. There are not many people who can see truth and also have the courage to speak it. He is one of them. Thank you Douglas!

0

Guy Regev

@_guyregev

over 1 year ago

@MosabHasanYOSEF @MosabHasanYOSEF - At the end they are all cry baby cowards. Nobody dares take the lead of this so called Hizb of Allah due to an obvious reason. The successor’s life span is shorter than that of a lab rat.

0

2

Guy Regev

@_guyregev

over 1 year ago

@BellaWallerstei We didn’t neutralize these sub-human scum by ignoring the UN. We neutralized them by making sure they weren’t around anymore. Ignoring the meaningless UN is just the default mode of operation. Lol

0

5

_guyregev retweeted

Mosab Hassan Yousef

@MosabHasanYOSEF

over 1 year ago

Terror leaders run for cover, but there is no place to hide... Those who have planned and executed Oct 7 genocide shall DIE. And those who have celebrated their free ride over the shoulders of democracy, spreading their hateful ideologies, shall kneel before the MASTER of Victory and Defeat.

MosabHasanYOSEF's tweet photo. Terror leaders run for cover, but there is no place to hide...

Those who have planned and executed Oct 7 genocide shall DIE. And those who have celebrated their free ride over the shoulders of democracy, spreading their hateful ideologies, shall kneel before the MASTER of Victory and Defeat.

279

8K

1K

140

221K

Guy Regev

@_guyregev

almost 2 years ago

@GadSaad @GadSaad - I think she is becoming more religious. To those who are not ready, it can be dangerous as it replaces logic with faith. Without the balance of critical thinking, unchecked faith can lead to the acceptance of questionable beliefs and propaganda.

0

4

_guyregev retweeted

Ben Shapiro

@benshapiro

about 2 years ago

Israel just carried out one of the most heroic hostage rescues in history. Leave it to the moral trash of the Left to weep for Hamas and its civilian collaborators, and to blame Israel for civilian deaths.

2K

10K

2K

416

1M

Guy Regev

@_guyregev

about 2 years ago

@MajmudarAdam Good work. But this is not a GPU. This is a multi core vector accelerator. Nice work though !

0

43

Guy Regev

@_guyregev

Who to follow

Last Seen Users on Sotwe

Trends for you

Most Popular Users