li

Single vs Multi-hand Attention by hand ✍️ Resize matrices yourself 👉 https://t.co/2yRBcyz4xy The most important fact about multi-head attention: it has the same parameter count as single-head attention. The difference is purely structural — same total Wqkv weights, partitioned into smaller q–k–v triples. Look at the two diagrams below. Both Wqkv matrices have the same height — same number of weight rows, same number of parameters. What changes is how that single tall block is sliced. • Left. One head. The full Wqkv produces one big QKV: a tall Q (36 rows), a tall K, a tall V. One scoring computation runs over those full-width tensors. • Right. 3 heads. The same-height Wqkv is sliced into 3 smaller q–k–v triples — each 12 rows tall. 3 scoring computations run in parallel, each a thinner version of the left. The compute trade-off — kind of. Same Wqkv weights. Multi-head runs the attention scoring S = Kᵀ × Q once per head, so the dot-product count multiplies by H. • Single-head: seq × seq = 40² = 1600 dot products • Multi-head: seq × seq × H = 40² × 3 = 4800 dot products (3×) But each multi-head dot product is narrower — its inner dimension is head_dim instead of H × head_dim. So when you count actual scalar multiplications, the totals are equal: • Single-head: seq² × (H × head_dim) = 40² × 36 = 57600 • Multi-head: seq² × H × head_dim = 40² × 3 × 12 = 57600 Same FLOPs. Multi-head buys you H independent attention patterns at no extra weight cost and no extra arithmetic cost — it's the same total compute, sliced into H finer-grained heads.

553

453

35K

li @geraldlee193

about 2 months ago

@zmx8067 可能都不知道发射了😂

li @geraldlee193

about 2 months ago

@autobypayment @loong_of SPD

geraldlee193 retweeted

Eason Mao☢

@KELMAND1

about 2 months ago

055当着美菲日面在菲律宾附近打了发YJ-20高超据亚洲防务安全网站(DEFENCE SECURITY ASIA)2026年4月25日报道：一艘055型驱逐舰在菲律宾附近海域与辽宁号航空母舰同时进行演习时，发射了另一枚YJ-20高超音速反舰导弹。这绝不仅仅是一次例行海军演习，因为它将中国最先进的海基反介入武器系统直接置于正在进行的“肩并肩2026”联合军演的战略格局之中。战略信息将十分明确：未来南海的任何突发事件将不再仅仅由礁石和浅滩周围的领土争端决定，而是由能够以航母打击群、远征部队和盟军海军后勤节点为目标的远程高超音速海上打击系统构成的可信威胁决定。 https://t.co/hZorhCTFnc

KELMAND1's tweet photo. 055当着美菲日面在菲律宾附近打了发YJ-20高超

据亚洲防务安全网站(DEFENCE SECURITY ASIA)2026年4月25日报道：一艘055型驱逐舰在菲律宾附近海域与辽宁号航空母舰同时进行演习时，发射了另一枚YJ-20高超音速反舰导弹。

这绝不仅仅是一次例行海军演习，因为它将中国最先进的海基反介入武器系统直接置于正在进行的“肩并肩2026”联合军演的战略格局之中。

战略信息将十分明确：未来南海的任何突发事件将不再仅仅由礁石和浅滩周围的领土争端决定，而是由能够以航母打击群、远征部队和盟军海军后勤节点为目标的远程高超音速海上打击系统构成的可信威胁决定。
https://t.co/hZorhCTFnc

301

92K

geraldlee193 retweeted

田字格

@tianzige4

about 2 months ago

肩并肩变成了搂搂抱抱😂😂😂中国军舰硬核围观美菲七国肩并肩军演

103

599

166K

li @geraldlee193

about 2 months ago

@UFesenbek70153 @mudiaoLOL @daneishe888 所以为什么要档再翻，大家要走这条路你非要卡一下然后让人家花钱翻过去有毛病么

li @geraldlee193

about 2 months ago

@LcKai @ChandlerHu16936 @daneishe888 你现在在逛哪？

geraldlee193 retweeted

Sebastian Raschka

@rasbt

about 2 months ago

April was a pretty strong month for LLM releases: - Gemma 4 - GLM-5.1 - Qwen3.6 - Kimi K2.6 - DeepSeek V4 All are now added to the LLM Architecture Gallery. More details once I am fully back in May!

rasbt's tweet photo. April was a pretty strong month for LLM releases:
- Gemma 4
- GLM-5.1
- Qwen3.6
- Kimi K2.6
- DeepSeek V4

All are now added to the LLM Architecture Gallery.
More details once I am fully back in May! https://t.co/HDYbWi2pcc

438

126K

geraldlee193 retweeted

Akshay 🚀

@akshay_pachaar

about 2 months ago

CPU vs GPU vs TPU vs NPU vs LPU, explained visually: 5 hardware architectures power AI today. Each one makes a fundamentally different tradeoff between flexibility, parallelism, and memory access. > CPU It is built for general-purpose computing. A few powerful cores handle complex logic, branching, and system-level tasks. It has deep cache hierarchies and off-chip main memory (DRAM). It's great for operating systems, databases, and decision-heavy code, but not that great for repetitive math like matrix multiplications. > GPU Instead of a few powerful cores, GPUs spread work across thousands of smaller cores that all execute the same instruction on different data. This is why GPUs dominate AI training. The parallelism maps directly to the kind of math neural networks need. > TPU They go one step further with specialization. The core compute unit is a grid of multiply-accumulate (MAC) units where data flows through in a wave pattern. Weights enter from one side, activations from the other, and partial results propagate without going back to memory each time. The entire execution is compiler-controlled, not hardware-scheduled. Google designed TPUs specifically for neural network workloads. > NPU This is an edge-optimized variant. The architecture is built around a Neural Compute Engine packed with MAC arrays and on-chip SRAM, but instead of high-bandwidth memory (HBM), NPUs use low-power system memory. The design goal is to run inference at single-digit watt power budgets, like smartphones, wearables, and IoT devices. Apple Neural Engine and Intel's NPU follow this pattern. > LPU (Language Processing Unit) This is the newest entrant, by Groq. The architecture removes off-chip memory from the critical path entirely. All weight storage lives in on-chip SRAM. Execution is fully deterministic and compiler-scheduled, which means zero cache misses and zero runtime scheduling overhead. The tradeoff is that it provides limited memory per chip, which means you need hundreds of chips linked together to serve a single large model. But the latency advantage is real. AI compute has evolved from general-purpose flexibility (CPU) to extreme specialization (LPU). Each step trades some level of generality for efficiency. The visual below maps the internal architecture of all five side by side. 👉 Over to you: Which of these 5 have you actually worked with or deployed on?

865

242K

geraldlee193 retweeted

Nainsi Dwivedi

@NainsiDwiv50980

about 2 months ago

Nvidia trained a billion-parameter LLM without a single gradient, without backprop, without fp32 weights anywhere. And it is 100x faster. For the last decade, every major AI model has been trained the exact same way. Backpropagation. It requires massive, expensive GPUs. It requires complex floating-point math. It requires a massive memory footprint just to calculate the gradients. It’s the reason why only mega-corporations can afford to train foundation models. Until today. Nvidia and Oxford published a paper called "Evolution Strategies at the Hyperscale." They completely bypassed backpropagation. Instead of calculating gradients, they use Evolution Strategies (ES), a method that randomly mutates the AI's parameters, sees what works best, and literally evolves the model. In the past, this was way too computationally expensive for billion-parameter models. But they fixed it by inventing EGGROLL (Evolution Guided General Optimisation via Low-rank Learning). By compressing the mutations into low-rank matrices, they achieved a 100x increase in training speed for large models. But that isn't the craziest part. Because it doesn't use backpropagation, it doesn't need high-precision math. They successfully trained a massive language model entirely on pure integer datatypes. Raw, basic, low-level math. This completely rewrites the economics of open-source AI. If you can train models directly on the cheap, fast integer datatypes they use for inference, the hardware requirements collapse. You don't need a multi-million dollar cluster of high-end GPUs just to do the math anymore.

NainsiDwiv50980's tweet photo. Nvidia trained a billion-parameter LLM without a single gradient, without backprop, without fp32 weights anywhere.

And it is 100x faster.

For the last decade, every major AI model has been trained the exact same way.

Backpropagation.

It requires massive, expensive GPUs. It requires complex floating-point math. It requires a massive memory footprint just to calculate the gradients.

It’s the reason why only mega-corporations can afford to train foundation models.

Until today.

Nvidia and Oxford published a paper called "Evolution Strategies at the Hyperscale."

They completely bypassed backpropagation.

Instead of calculating gradients, they use Evolution Strategies (ES), a method that randomly mutates the AI's parameters, sees what works best, and literally evolves the model.

In the past, this was way too computationally expensive for billion-parameter models.

But they fixed it by inventing EGGROLL (Evolution Guided General Optimisation via Low-rank Learning).

By compressing the mutations into low-rank matrices, they achieved a 100x increase in training speed for large models.

But that isn't the craziest part.

Because it doesn't use backpropagation, it doesn't need high-precision math.

They successfully trained a massive language model entirely on pure integer datatypes. Raw, basic, low-level math.

This completely rewrites the economics of open-source AI.

If you can train models directly on the cheap, fast integer datatypes they use for inference, the hardware requirements collapse.

You don't need a multi-million dollar cluster of high-end GPUs just to do the math anymore.

212

151

14K

geraldlee193 retweeted

Phoenix Yin

@Phoenixyin13

about 2 months ago

🚨 震撼！2026年4月Science重磅：斯坦福团队（中国博士后一作）发现蛋白质能直接指导合成DNA。近70年的中心法则被彻底改写了！以前的情况是，DNA→RNA→蛋白质（信息单向）现在，斯坦福团队发现，蛋白质也能反向→DNA！完全不需要DNA/RNA模板，就靠蛋白质自己的氨基酸结构当模具，精准造出AC重复DNA序列。这是克里克提出中心法则以来最大改写！蛋白质第一次被证实能直接携带并传递遗传信息。细菌用它对抗病毒，我们人类可能即将迎来基因工程新纪元。最令我感到喜悦的两件事： 1.论文里提到用 AlphaFold3 辅助建模Drt3b的结构，这是生物学研究的常规操作，但我仍然感受到AI与生物的有机结合。 2. 蛋白质直接当模板造DNA的机制，打开了蛋白质指导遗传信息的新玩法。如果未来能工程化改造Drt3b，让它按照AI设计的蛋白结构，精准合成任意DNA序列，那基因合成、基因编辑、合成生物学可能会迎来大跃进，不再完全依赖传统模板依赖的聚合酶，这将是伟大的未来。总之，2026年的4月，人类创造了一场生命科学基础理论的震撼。请大家记住此刻，记住这神奇而美妙的四月。 #中心法则 #蛋白质DNA #Science #生物学突破 #斯坦福