The UK is threatening tech bosses with prison if they don't install surveillance software on every phone in Britain.
The minister who resigned for not doing this fast enough called it "incremental change."
GOOGLE JUST COMPRESSED 31GB OF AI MEMORY INTO 4GB.
That’s nearly an 8x reduction.
Memory has quietly become one of the biggest bottlenecks for running large AI models at scale. Every extra gigabyte limits how many users a system can serve and how much context it can handle.
By dramatically compressing memory requirements without meaningfully hurting performance, Google is making it possible to run larger workloads on the same hardware.
@marias_martin@KaiXCreator My bad, he says "forget the code even exists" inferring that he never looks at the code, and also mentions he doesn't look at the diffs. Source: https://t.co/7j5TF9mlaU
There's a new kind of coding I call "vibe coding", where you fully give in to the vibes, embrace exponentials, and forget that the code even exists. It's possible because the LLMs (e.g. Cursor Composer w Sonnet) are getting too good. Also I just talk to Composer with SuperWhisper so I barely even touch the keyboard. I ask for the dumbest things like "decrease the padding on the sidebar by half" because I'm too lazy to find it. I "Accept All" always, I don't read the diffs anymore. When I get error messages I just copy paste them in with no comment, usually that fixes it. The code grows beyond my usual comprehension, I'd have to really read through it for a while. Sometimes the LLMs can't fix a bug so I just work around it or ask for random changes until it goes away. It's not too bad for throwaway weekend projects, but still quite amusing. I'm building a project or webapp, but it's not really coding - I just see stuff, say stuff, run stuff, and copy paste stuff, and it mostly works.
You don’t “run a model”
You run Kernels
The model is just a graph
The Inference Engine is scheduler / optimizer / executor
But the actual work? That happens in the Kernels
- MatMul Kernels
- Attention Kernels
- RMSNorm Kernels
- KV cache Kernels
- Quantized linear Kernels
- Sampling Kernels
- Fused “please don’t write this back to memory 9 times” Kernels
Same model, same GPU, same VRAM
Wildly different performance
Because one stack is using optimized fused Kernels that understand your hardware
And the other stack is playing hot potato with tensors through 47 tiny launches and pretending the GPU is the problem
Bad Kernels make people say:
“this model is slow”
Good Kernels make people say:
“wait how is this running locally?”
This is why Inference Engines and the Kernels implemented within them matter
The model is the recipe
The hardware is the kitchen
The Kernels are the knives, pans, burners, and the chef not cutting onions with a spoon
Most people benchmark models
The real ones benchmark the Kernels underneath
I think the challenge is that everyone can now build apps
But
1) almost nobody has distribution (like an audience), or
2) the money to pay for distribution (ads or UGC), or
3) the creative genius to get distribution for free (classically called guerilla marketing)