Jason H. @jhofmann - Twitter Profile

Every time you message an AI chatbot, the model stores your entire conversation in temporary memory called a KV cache (a cheat sheet so it doesn’t re-read everything from scratch). On a large model like Llama 70B running a long conversation, that cache alone eats 40GB of GPU space, often more than the AI model itself. That’s half a $30,000 GPU chip consumed by one user’s memory. Google just published TurboQuant, a compression algorithm that shrinks this cache by 6x, down to just 3 bits per value, with zero accuracy loss across every benchmark tested. No retraining. No fine-tuning. Drop-in replacement. AI inference (running models for actual users, not training them) now makes up 55% of all AI compute spending. Hyperscalers are pouring nearly $700 billion into AI infrastructure in 2026. The KV cache is the single biggest memory bottleneck in that stack. When GPU cache memory fills up, the system can’t take more users. 6x compression means the same hardware handles roughly 6x more simultaneous conversations, or 6x longer context windows, or some mix of both. At cloud rates of $2-3/hour per H100 GPU, that’s the difference between profitable and unprofitable AI deployment. TurboQuant randomly rotates data to simplify its structure, applies a compressor, then adds a 1-bit error correction step to catch errors before they compound. On H100 GPUs it delivers up to 8x speedup over uncompressed computation. Google tested it across five long-context benchmarks on Llama, Gemma, and Mistral models. Perfect scores on needle-in-a-haystack (finding one specific fact buried in massive text). Being presented at ICLR 2026. It also outperforms existing methods for vector search, the technology that powers how search engines find similar results across billions of entries. Google runs billions of these searches daily. Three bits. Zero loss. 6x compression on the biggest memory bottleneck in a $700 billion infrastructure buildout.

48

2K

175

1K

367K

0

2

0

134

Jason H.

@jhofmann

5 months ago

@nanobyte84 @supernalmystic @washghost1 It’s actually the other way around, dust and pet hair won’t get sucked up from across the room because they are too heavy - think of how close you need to get a vacuum head to the floor before the dust will budge.

0

16

Jason H.

@jhofmann

5 months ago

@nanobyte84 @supernalmystic @washghost1 It’s captures plenty of tiny stuff. @grok are CR boxes effective at removing small particles?

1

0

21

Jason H.

@jhofmann

5 months ago

@DutchRojas @mcuban I assumed this was you posting at first 😆

0

4

Jason H.

@jhofmann

7 months ago

@iyoushetwt First editor: vi First graphical editor: IntelliJ IDEA

0

3

Jason H.

@jhofmann

8 months ago

@NoHealthNoFun @lgoshen @motiongirlie Germany is practically the homeopathy capital of the world - it's shameful.

0

3

0

20

Jason H.

@jhofmann

8 months ago

@LauraMiers What do you mean it was solved?

1

0

23

Jason H.

@jhofmann

8 months ago

@LauraMiers I don’t know anything about this machine, I’m just good at searching - but this looks pretty low tech. Don’t think there are major innovations in hardware or software, this is just a way to show investors a “recurring subscription-based revenue stream” to get a better valuation.

0

1

0

20

Jason H.

@jhofmann

8 months ago

@LauraMiers I know, crazy. I first learned about that business model when at our dermatologist and they had to insert a “treatment card” into a machine to use it, and I asked about it. Found this video: https://t.co/hUEbs1LkxP Perhaps you can try calling your mother’s doctor about it?

1

2

0

71