ShadmanRohan @RohanShadman - Twitter Profile

about 1 month ago

Stanford CS336 上，Tatsu 讲了一节 LLM 架构课，把过去 3 年所有主流 LLM 拆开，看它们的共通模板结论挺爆：90% 的架构选择已经收敛，你随便挑一个开源大模型，它跟其他模型在这些维度上几乎一模一样讲师的原话 - 2024 年大家都在 cosplay Llama2 - 2025 年的主题是「怎么训得不崩」 - 2026 年的主题是「怎么扛住长上下文」下面是 2026 年开源 LLM 的标准模板你训自己的模型可以直接抄【架构层已经收敛的 7 件事】 1）Layer Norm 挪出残差流（pre-norm）原版 Transformer 把 LN 放在残差里几乎所有现代模型都挪到外面原因：keep your residual stream clean 梯度反传更稳 2）RMS Norm 替代 LayerNorm LayerNorm 的减均值 + 加 bias 那部分实际没怎么帮上忙丢掉之后 flops 只省 0.17% 但运行时省到 25% （瓶颈在数据搬运计算反而次要） 3）所有 bias 项全删跟 RMS Norm 一个道理系统层省内存搬运 4）激活函数用 SwiGLU 或 GeGLU gated linear unit 几乎所有现代模型都用 Llama 系 / Qwen / Mistral 用 SwiGLU Google 系（Gemma / T5）用 GeGLU 区别极小选哪个都行 5）位置编码用 RoPE 2024 年之后基本统一了原理：把每对维度按位置旋转一个角度让 inner product 只依赖相对位置 6）Transformer block 串联（不是并联） GPT-J / Palm 试过并联现在基本被放弃串联的实现优化得太好了并联省的那点系统开销不值得损失表达力 7）Layer norm 可以「撒」哪儿不稳就在哪儿加 LN attention 之前能加之后能加两边都加（double norm）也可以现代模型很多这样做【超参数已经收敛的 5 个数】 1）feedforward 维度 / hidden 维度 - 非 GLU 模型：4 倍 - GLU 模型：8/3 ≈ 2.67 倍（因为 GLU 多一组矩阵要保持总参数量） - Llama 系：3.5 倍 - T5 1.0 试过 64 倍后来 T5 1.1 改回标准别学 2）head 数 × head 维度 ≈ hidden 维度几乎所有模型都遵守 T5 是为数不多的例外 3）模型纵横比（hidden / 层数）≈ 100 太深 pipeline parallel 难做太宽表达力受限 100 这个数字是系统约束 + 表达力的平衡点 4）vocab size 单语模型：30K 左右（早期 GPT-2 那种）多语 / 通用模型：100K-200K（GPT-4 / Llama 3 / Gemma 都在这个范围）现代基本都是后者 5）weight decay 仍然普遍使用但研究发现它在 LLM 里干的事其实是优化器干预让你最终能收敛到更深的最优点跟你想的「防过拟合」没什么关系所以别因为「单 epoch 不会过拟合」就把它关掉【稳定性三个救命 trick】训练大模型最怕中途 loss 突然飙升然后 NaN 全军覆没现代模型用三个 trick 防这件事 1）Z-loss output softmax 的 normalizer 容易爆加一个 (log Z)² 的正则项让 Z 始终接近 1 DCLM / Olmo 都用 2）QK norm attention 的 Q 和 K 在矩阵乘之前各加一个 LN 让 softmax 的输入永远是单位尺度 multimodal 圈先用起来现在所有大模型都加 3）Logit soft cap（仅 Google 系） attention logit 用 tanh 硬封顶 Gemma 2/3/4 都在用但会损失一点点性能慎用【Attention 两个新趋势】 1）GQA（Grouped Query Attention）几乎统一原版 multi-head 推理时 KV cache 会让算术强度崩到 1/h GQA 共享 K 和 V 但保留多个 Q 表达力几乎不损失推理成本砍掉 80% 现在所有要做生产部署的大模型没有不用 GQA 的 2）局部 + 全局 attention 交替处理长上下文的新方式 Cohere Command A 起头现在 Llama 4 / Gemma 4 / Olmo 3 全在用比如每 4 层有 1 层 full attention 其他 3 层是 sliding window 只看附近的 token 比纯 SSM 更稳比纯 full attention 便宜得多（Qwen 3.5 做了变体把 sliding window 那 3 层换成 SSM）收尾一句如果你正在训自己的 LLM，上面这一套就是 2026 年的「默认配置」不需要重新发明，直接抄如果你只是想看懂 GitHub 上那些 modeling_xxx.py 这一份足够你不再被术语吓住

29

3K

589

5K

533K

RohanShadman retweeted

Roan

@RohOnChain

about 2 months ago

This 2 hour Stanford lecture shows exactly how Stanford trains it's engineers to build AI systems. It's more practical than every Claude tutorial & prompting threads you've seen. Bookmark & give it 2 hours, no matter what. It'll be the most productive thing you do this weekend.

159

14K

2K

52K

2M

RohanShadman retweeted

François Chollet

@fchollet

3 months ago

AI agents will soon graduate to fully-fledged economic actors that buy services, compute, and even data in the course of accomplishing high-level goals. 1-2 years before we start seeing this at scale.

195

2K

184

447

263K

RohanShadman retweeted

Boris Cherny

@bcherny

5 months ago

I'm Boris and I created Claude Code. Lots of people have asked how I use Claude Code, so I wanted to show off my setup a bit. My setup might be surprisingly vanilla! Claude Code works great out of the box, so I personally don't customize it much. There is no one correct way to use Claude Code: we intentionally build it in a way that you can use it, customize it, and hack it however you like. Each person on the Claude Code team uses it very differently. So, here goes.

1K

55K

7K

104K

8M

Who to follow

AI Research @Meta, @AIatMeta (FAIR), CS PhD Candidate @HebrewU, @HyadataLab | Past: @Lightricks @TU_Muenchen @UniMelb

Hirotaka Tahara

@tahara_hirotaka

博士(工学) | NAISTロボットラーニング → 神戸高専講師 | 人協働知能システム研究室@tahara_lab ・神戸高専ロボティクス@kcct_robotics

ShadmanRohan @RohanShadman

5 months ago

2019 vs. Today. We’ve come a long way. Back then, the “gotcha” was: ask it a simple arithmetic word problem and it collapses. Today, Fields Medalists are using these models to turn research math into machine-checkable proofs.

RohanShadman's tweet photo. 2019 vs. Today. We’ve come a long way.

Back then, the “gotcha” was: ask it a simple arithmetic word problem and it collapses. Today, Fields Medalists are using these models to turn research math into machine-checkable proofs. https://t.co/V9wp8Mdjuu

0

11

RohanShadman retweeted

Akshay 🚀

@akshay_pachaar

6 months ago

When outputs are verifiable, labels become optional. Maths, code, and logic can be automatically checked and validated. Let's use this fact to build a reasoning model without manual labelling. We'll use: - @UnslothAI for parameter-efficient finetuning. - @HuggingFace TRL to apply GRPO. Let's go! 🚀

3

78

9

139

48K

RohanShadman retweeted

Paata Ivanisvili

@PI010101

8 months ago

GPT-5 Pro found a counterexample to the NICD-with-erasures majority optimality (Simons list, p.25). https://t.co/T3m9MYgqe0 At p=0.4, n=5, f(x) = sign(x_1-3x_2+x_3-x_4+3x_5) gives E|f(x)|=0.43024 vs best majority 0.42904.

PI010101's tweet photo. GPT-5 Pro found a counterexample to the NICD-with-erasures majority optimality (Simons list, p.25).
https://t.co/T3m9MYgqe0

At p=0.4, n=5, f(x) = sign(x_1-3x_2+x_3-x_4+3x_5) gives E|f(x)|=0.43024 vs best majority 0.42904. https://t.co/kocWu7D9HD

55

1K

211

548

773K

ShadmanRohan @RohanShadman

8 months ago

Current LLMs are hitting the ceiling on “more tokens = better thinking.” A promising direction is procedural memory over ever-longer chains of thought—capturing recurring reasoning as reusable behaviors. Think smarter, not just longer. #AI #LLM #Reasoning #Efficiency #MLOps

RohanShadman's tweet photo. Current LLMs are hitting the ceiling on “more tokens = better thinking.”

A promising direction is procedural memory over ever-longer chains of thought—capturing recurring reasoning as reusable behaviors.

Think smarter, not just longer.
#AI #LLM #Reasoning #Efficiency #MLOps https://t.co/6gmhl06re7

0

16

ShadmanRohan @RohanShadman

9 months ago

New from Google Research✨: Learn Your Way🎒 Upload a 📚textbook/PDF → Interactive Lessons 🧭Mind maps ⚡Quizzes 🎧Audio lessons 📊11% better retention (78% vs. 67%) vs digital reader. 𝗧𝗵𝗼𝘂𝗴𝗵𝘁𝗳𝘂𝗹 𝗱𝗲𝘀𝗶𝗴𝗻 > 𝗺𝗼𝗿𝗲 𝘀𝗰𝗿𝗲𝗲𝗻 𝘁𝗶𝗺𝗲. #AI #education #google

0

27

ShadmanRohan @RohanShadman

10 months ago

🎉 Just had an incredible experience attending The 63rd Annual Meeting of the Association for Computational Linguistics! 🎉 - via #Whova event app

RohanShadman's tweet photo. 🎉 Just had an incredible experience attending The 63rd Annual Meeting of the Association for Computational Linguistics! 🎉 - via #Whova event app https://t.co/PdIGQtlivl

0

1

0

48

ShadmanRohan @RohanShadman

over 1 year ago

@godofprompt Oh no,, you ss so,

0

11

RohanShadman retweeted

Matthew Berman

@MatthewBerman

over 1 year ago

1/ Google Research unveils new paper: "Titans: Learning to Memorize at Test Time" It introduces human-like memory structures to overcome the limits of Transformers, with one "SURPRISING" feature. Here's why this is huge for AI. 🧵👇

MatthewBerman's tweet photo. 1/ Google Research unveils new paper: "Titans: Learning to Memorize at Test Time"

It introduces human-like memory structures to overcome the limits of Transformers, with one "SURPRISING" feature.

Here's why this is huge for AI. 🧵👇 https://t.co/NqdYdnw4bQ

58

3K

431

3K

433K

RohanShadman retweeted

Aaron Mueller @amuuueller

over 1 year ago

What can mechanistic interpretability do for computational psycholinguists? @michaelwhanna and I took a stab at this question! We investigate garden path sentence processing in LMs at the feature (circuit) level.

2

59

10

6K

RohanShadman retweeted

Wei Xu

@cocoweixu

over 1 year ago

We wrapped up CS 8803 "Large Language Model" class at @GeorgiaTech for Fall 2024. Here is the reading list: • learning from human preferences (PPO, DPO, SimPO, CPO, RRHF, ORPO, CTO) • real-world LLM (Llama-3, Aya, Arena's) • efficient LLM (MoMa, LoRA, QLoRA, LESS)

cocoweixu's tweet photo. We wrapped up CS 8803 "Large Language Model" class at @GeorgiaTech for Fall 2024.

Here is the reading list:

• learning from human preferences (PPO, DPO, SimPO, CPO, RRHF, ORPO, CTO)
• real-world LLM (Llama-3, Aya, Arena's)
• efficient LLM (MoMa, LoRA, QLoRA, LESS) https://t.co/W6R3fIUafy

14

1K

166

1K

96K

ShadmanRohan @RohanShadman

almost 2 years ago

@MAarafat71 Star Jalsha shuru hoa gese😂

0

1

0

205

RohanShadman retweeted

Arpit Adlakha @arpit20adlakha

almost 2 years ago

One of the finest roadmaps I have seen for Senior Software Interviews, a guy posted on LeetCode for clearing Uber L5A, L5B or Google L5/L6 levels.

arpit20adlakha's tweet photo. One of the finest roadmaps I have seen for Senior Software Interviews, a guy posted on LeetCode for clearing Uber L5A, L5B or Google L5/L6 levels. https://t.co/YwvHpnDNBk

68

7K

526

19K

2M

RohanShadman retweeted

Tim Denning

@Tim_Denning

about 2 years ago

I’ve spent over 120 hours studying one of the most controversial authors. Nassim Taleb. Here are 11 of his best lessons ↓

115

3K

576

6K

1M

RohanShadman retweeted

Ole Lehmann

@itsolelehmann

about 2 years ago

I'm 32. After living my whole life in Germany, last year I took the leap and moved abroad to Cyprus. It's the greatest lifestyle upgrade I've ever experienced. 20 lessons for living the good life abroad (that'll make your move easier):

itsolelehmann's tweet photo. I'm 32.

After living my whole life in Germany, last year I took the leap and moved abroad to Cyprus.

It's the greatest lifestyle upgrade I've ever experienced.

20 lessons for living the good life abroad (that'll make your move easier): https://t.co/57GleG5yvC

25

409

22

470

260K

ShadmanRohan @RohanShadman

about 2 years ago

@rose_e_wang Will the workshop submissions be included in the conference proceedings, or are they considered non-archival?

1

0

32

RohanShadman retweeted

Toan Truong

@ToanTruong_

over 2 years ago

I'm 18. I’m obsessed with learning how to learn. So, I spent 200+ hours studying how geniuses, prodigies, and high performers master their disciplines. Here's what I found on how to master anything faster:

ToanTruong_'s tweet photo. I'm 18.

I’m obsessed with learning how to learn.

So, I spent 200+ hours studying how geniuses, prodigies, and high performers master their disciplines.

Here's what I found on how to master anything faster: https://t.co/UMuT8ZiLWu

441

31K

6K

52K

6M

ShadmanRohan

@RohanShadman

Who to follow

Last Seen Users on Sotwe

Trends for you

Most Popular Users