Single vs Multi-hand Attention by hand ✍️ Resize matrices yourself 👉 https://t.co/2yRBcyz4xy
The most important fact about multi-head attention: it has the same parameter count as single-head attention. The difference is purely structural — same total Wqkv weights, partitioned into smaller q–k–v triples.
Look at the two diagrams below. Both Wqkv matrices have the same height — same number of weight rows, same number of parameters. What changes is how that single tall block is sliced.
• Left. One head. The full Wqkv produces one big QKV: a tall Q (36 rows), a tall K, a tall V. One scoring computation runs over those full-width tensors.
• Right. 3 heads. The same-height Wqkv is sliced into 3 smaller q–k–v triples — each 12 rows tall. 3 scoring computations run in parallel, each a thinner version of the left.
The compute trade-off — kind of. Same Wqkv weights. Multi-head runs the attention scoring S = Kᵀ × Q once per head, so the dot-product count multiplies by H.
• Single-head: seq × seq = 40² = 1600 dot products
• Multi-head: seq × seq × H = 40² × 3 = 4800 dot products (3×)
But each multi-head dot product is narrower — its inner dimension is head_dim instead of H × head_dim. So when you count actual scalar multiplications, the totals are equal:
• Single-head: seq² × (H × head_dim) = 40² × 36 = 57600
• Multi-head: seq² × H × head_dim = 40² × 3 × 12 = 57600
Same FLOPs. Multi-head buys you H independent attention patterns at no extra weight cost and no extra arithmetic cost — it's the same total compute, sliced into H finer-grained heads.
April was a pretty strong month for LLM releases:
- Gemma 4
- GLM-5.1
- Qwen3.6
- Kimi K2.6
- DeepSeek V4
All are now added to the LLM Architecture Gallery.
More details once I am fully back in May!
CPU vs GPU vs TPU vs NPU vs LPU, explained visually:
5 hardware architectures power AI today.
Each one makes a fundamentally different tradeoff between flexibility, parallelism, and memory access.
> CPU
It is built for general-purpose computing. A few powerful cores handle complex logic, branching, and system-level tasks.
It has deep cache hierarchies and off-chip main memory (DRAM). It's great for operating systems, databases, and decision-heavy code, but not that great for repetitive math like matrix multiplications.
> GPU
Instead of a few powerful cores, GPUs spread work across thousands of smaller cores that all execute the same instruction on different data.
This is why GPUs dominate AI training. The parallelism maps directly to the kind of math neural networks need.
> TPU
They go one step further with specialization.
The core compute unit is a grid of multiply-accumulate (MAC) units where data flows through in a wave pattern.
Weights enter from one side, activations from the other, and partial results propagate without going back to memory each time.
The entire execution is compiler-controlled, not hardware-scheduled. Google designed TPUs specifically for neural network workloads.
> NPU
This is an edge-optimized variant.
The architecture is built around a Neural Compute Engine packed with MAC arrays and on-chip SRAM, but instead of high-bandwidth memory (HBM), NPUs use low-power system memory.
The design goal is to run inference at single-digit watt power budgets, like smartphones, wearables, and IoT devices.
Apple Neural Engine and Intel's NPU follow this pattern.
> LPU (Language Processing Unit)
This is the newest entrant, by Groq.
The architecture removes off-chip memory from the critical path entirely. All weight storage lives in on-chip SRAM.
Execution is fully deterministic and compiler-scheduled, which means zero cache misses and zero runtime scheduling overhead.
The tradeoff is that it provides limited memory per chip, which means you need hundreds of chips linked together to serve a single large model. But the latency advantage is real.
AI compute has evolved from general-purpose flexibility (CPU) to extreme specialization (LPU). Each step trades some level of generality for efficiency.
The visual below maps the internal architecture of all five side by side.
👉 Over to you: Which of these 5 have you actually worked with or deployed on?
Nvidia trained a billion-parameter LLM without a single gradient, without backprop, without fp32 weights anywhere.
And it is 100x faster.
For the last decade, every major AI model has been trained the exact same way.
Backpropagation.
It requires massive, expensive GPUs. It requires complex floating-point math. It requires a massive memory footprint just to calculate the gradients.
It’s the reason why only mega-corporations can afford to train foundation models.
Until today.
Nvidia and Oxford published a paper called "Evolution Strategies at the Hyperscale."
They completely bypassed backpropagation.
Instead of calculating gradients, they use Evolution Strategies (ES), a method that randomly mutates the AI's parameters, sees what works best, and literally evolves the model.
In the past, this was way too computationally expensive for billion-parameter models.
But they fixed it by inventing EGGROLL (Evolution Guided General Optimisation via Low-rank Learning).
By compressing the mutations into low-rank matrices, they achieved a 100x increase in training speed for large models.
But that isn't the craziest part.
Because it doesn't use backpropagation, it doesn't need high-precision math.
They successfully trained a massive language model entirely on pure integer datatypes. Raw, basic, low-level math.
This completely rewrites the economics of open-source AI.
If you can train models directly on the cheap, fast integer datatypes they use for inference, the hardware requirements collapse.
You don't need a multi-million dollar cluster of high-end GPUs just to do the math anymore.