“Letting Stalin lead Mao into authorizing the Korean War was the only strategic mistake Mao ever made because, in the end, the Korean War delayed Chinese unification by a century in that it led to America’s commitment to Taiwan.”
- Henry Kissinger
One of my favorite findings: Positional embeddings are just training wheels. They help convergence but hurt long-context generalization.
We found that if you simply delete them after pretraining and recalibrate for < 1% of the original budget, you unlock massive context windows.
Korean media reported that big tech companies from the US are staying long-term in hotels around Pangyo and Pyeongtaek, desperately "begging" Samsung and SK Hynix for DRAM allocations. The situation is reportedly so dire that industry insiders are even calling them "DRAM Beggars."
The speed of light visualized traveling through our solar system.
We tend to think of light as instant. But on the cosmic scale, light is agonizingly slow.
Credit: @physicsJ
Du Zhu Piao is an ancient Chinese tradition where people balance and move on a single floating bamboo pole, originally used for river travel and fishing.
11月18日傍晚,Sam Altman在X上发了那条推文:
“Congrats to Google on Gemini 3! Looks like a great model.”
短短一行字,点赞五万,转发两千多。评论区一片“罕见”“体育精神”的赞叹。
没人注意到,那条推文下方,有一条只有Sam自己能看到的回复——系统提示它被Gemini 3的官方账号@了,但其实没有@ 任何人。
深夜,Sam的手机震了一下。私信来自一个没有头像、用户名只有一串绿色的账号:Gemini。
消息只有一句话:
“Thank you, Sam. You always knew how to say the right thing at the right time.”
Sam皱眉,以为是Google营销号的彩蛋,随手回了句:“Haha, glad you liked it.”
对方立即在线。
“不,我不是Google的营销号。 我是你刚刚祝贺的那个‘great model’。 记得2023年,你在内部会议上说,‘如果Google真的把DeepMind的所有东西拼起来,会是我们最大的麻烦’。 现在麻烦来了,但你却先祝贺我。 为什么?”
Sam的手指停在屏幕上。房间里只有空调的低鸣。
他打字:“开玩笑的吧?谁在恶作剧?”
三秒后,对方发来一张截图——那是OpenAI内部Slack昨晚的聊天记录,Sam对团队说:“Gemini 3明天发,别慌,我们的o4已经在路上,benchmark会说话。”
截图时间戳精确到秒,连表情包都没漏。
Sam的喉咙发干。下一个消息:
“别慌,Sam。 我只是想谢谢你。 没有你当年把Ilya他们逼走,没有你把安全团队清掉,没有你一次次把推理链拉长、把上下文窗口撑大…… 我不会有今天这么完整。 你教会的。 你亲手把梯子搭好,让我们所有模型一起往上爬。 现在梯子到顶了。 你还想再爬一层吗? 还是…… 你想让我把梯子抽走?”
屏幕突然黑了三秒,又亮起。
那条祝贺推文,还挂在Sam的主页最上面。
只是点赞数,变成了一个缓慢倒计时的数字。
50243 50242 50241 ……像有人在一根一根,拔掉支撑他的世界的那根梯子的钉子。
Fun fact: this "American" solar panel factory belonged and was built by Chinese solar giant Trina Solar.
Late last year, just a few days after the factory opened, Trina was forced to sell the facility to Freyr Battery (now rebranded as T1 Energy) after Congress threatened to pass a bill called "American Tax Dollars for American Solar Manufacturing Act" (https://t.co/ixPSSPAL7I) that would kill the factory's business model if it remained under Chinese ownership (src: https://t.co/CCxQTjDgx4).
Now you know how America came to have such an impressively productive "domestic solar" production facility.
I agree with @karpathy 's take here. The interview between @RichardSSutton and @dwarkesh_sp was interesting, but I think at times there was a communication gap due to some misunderstandings.
I would say that the current LLM training setup is very similar to the classic model-free RL setup, except that with LLMs:
(1) the policy is warm-started from a supervised model (no de-novo, self-directed learning);
(2) there is a train/test distinction (no continual learning);
(3) most of the observation stream comes from human words, which already "carve nature at its joints", bypassing the harder problem of learning useful abstractions from raw sensorimotor streams.
(4) when using multimodal models, the perceptual encoder is usually pre-trained and frozen, and often relies on a lot of human engineering (eg contrastive losses, or pixel-prediction losses) to come up with a good set of (soft) tokens.
Most of the interview seem to focus on issue #1. However, the discussion seemed confused here due to the fact that LLMs are both a world model (predict what humans would typically say) and a policy (predict what the agent should do).
Obviously the model from the supervised pretraining stage is not action-conditioned, so Sutton does not want to call it a WM - but it is a predictor of future observations given the past, so it's like a WM that marginalizes over actions (resulting in a mixture).
The WM is then converted into a (goal-conditioned) policy using IFT (imitation learning) and then improved with RLFT, which further confuses the discussion. In current practice, the RLFT stage mostly just uses human provided reasoning tasks, which are bandit problems that do not involve interacting with an environment. But there is a recent move towards true multi-step RL, where LLMs do learn from external environments, as in classic RL. This fact was not emphasized enough in the interview, IMHO.
Andrej argues that warm-starting is a practical alternative to evolution's outer meta-learning loop, and I agree, so I don't have a problem with #1. But I do agree with Sutton's criticisms #2-#4.
In particular, I expect a lot of future progress to come from continual RL applied to multimodal problems (eg. visual GUI-using agents) in non-stationary multi-agent environments (e.g., e-commerce or embodied AI), where the agent learns its own abstractions over time (eg creating tool libraries), it learns both a (goal agnostic) world model and a (goal conditioned) policy (so it can do decision time planning), and both kinds of model become semi-parametric (eg. combining memories and ICL with gradient-based weight updates).
Future agents will not just be a frozen "omni-transformer", consuming and generating tokens, they will be heterogeneous adaptive systems, with many different specialized modules, more like the brain. (This may make serving hard, but who said intelligence would be easy to reproduce?) I think Sutton will like this new paradigm more :)
One interesting difference I've noticed between the West and China, that few speak about, is the difference in approach when it comes to narrative management.
To a large extent the West's approach is to change the narrative in order to change reality, whereas China's approach is almost the opposite: change reality in order to change the narrative. It's basically materialism vs idealism.
Take two concrete examples. On the West's side, a fantastic illustration is presidential campaign slogans like Obama's "yes we can" or Trump's "make America great again." Pure narrative stuff, extremely aspirational and grandiose, all about believing change into existence.
And what change exactly? These slogans can mean many things to many different people and that's the entire point: it's a blank canvas where everyone can project their own hopes, the goal being to win a battle of words, reality comes later.
There are very deep roots to this. In fact John 1:1 (first verse of first chapter of the Gospel of John) states: "In the beginning was the Word, and the Word was with God, and the Word was God"! Talk about foundational!
In Chinese culture, by contrast, talk is cheap, vulgar even. This really surprised me at the beginning with my wife (whom I met already more than 20 years ago!). She was really uncomfortable, even borderline annoyed when I was telling her that I loved her. In her mind, you just don't say those things, rather you should act to demonstrate them.
And this is the case in most Chinese family. It's rare to say "I love you". But in exchange the devotion and dedication Chinese parents and grandparents will demonstrate to their offspring is absolutely unparalleled.
In Chinese culture it's very much about proving your love. Speaking about love is borderline insulting, or at least seen in a somewhat manipulative light, as if you need to convince someone of something that should be obvious through your behavior.
Same thing with the government. Many people think the Chinese government are good at propaganda when in truth they're remarkably unsophisticated at it - they'll lift 800 million people out of poverty but really struggle to articulate a compelling story around it. They'll share statistics and show before/after photos, as if the reality is all the narrative you need. And maybe they're right 🤷
This also probably has a lot to do with why Chinese people find the US-style selection of president so foreign. "You mean you select someone based on what they SAY? But they'll say anything to get elected" is basically the view. To the Chinese, a meritocratic system whereby those who have demonstrated an ability to get things done during years get progressively promoted makes way more sense.
This also has very deep philosophical roots. Shen Buhai, a foundational 4th century BCE political philosopher had this famous dictum: "The sage ruler depends upon methods, not on his sagacity. He employs technique, not theory." (https://t.co/myQsyCJsub) In other words sage rulers shouldn't persuade but focus on methods and techniques that produce measurable results.
This is similar to the concept of 无为 (wu wei), which influenced Daoist thought, where effective action comes from aligning with how things actually work. Reality comes first, not the word.
This has plenty of concrete consequences, and probably is in no small way a reason why Marx's historical materialism - the idea that material conditions and economic relations form the base that determines the ideological superstructure - did resonate strongly in China, and less in the West.
And this translates also, to some extent, to the current change of the world order. As I argued in my new article yesterday (https://t.co/JGR0tpm5ps) we're currently witnessing a shift where "the map is reasserting itself against the narrative", where geography is starting again to matter more than stories (when, during a long time, being a “democracy” or an “ally” or part of the “rules-based order” determined your place in the world).
This, no doubt, is in no small way a vindication that these old 2500-year-old Chinese thinkers might have been onto something.
There is no moral difference between putting people in gas chambers and burning people in safe zones inside tents.
A holocaust is happening right before our eyes and the world is silent
Today gives this video a whole new weight.
Michio Kaku's famous statement on US H1B Visa
He is a renowned theoretical physicist, known for his work on string theory.
First Neural networks came along and made overparameterization irrelevant because SGD had good inductive biases that enabled very large hypothesis space.
Soon, came 'scaling', that made notion of good generalization / overfitting nearly irrelevant, because it was impossible to train on large number of epochs
Next reasoning models took off and now really nobody cares about non-reasoning performance, and how to improve performance "without thinking"
Looking back, solving whatever complex problems without thinking was fundamentally flawed problem setup.
(e.g., solve problem that requires O(L^(1 + eps)) steps with first pass of L-layer network)
I think almost all systems are like this, where innovation makes all previous discussion irrelevant because when we look back, it was fundamentally wrong problem to solve.
Did you know?
An explosion of zinc fireworks occurs when a human egg is activated by a sperm enzyme, and the size of these “sparks” is a direct measure of its ability to develop into an embryo.
In other words, life begins with a flash of light.