vvomen @vvomen181732 - Twitter Profile

about 8 hours ago

JUST IN: German public broadcaster begins “anti-AC campaign” warning of the dangers of air conditioning, as record heat wave hits the region.

1K

14K

2K

2M

vvomen181732 retweeted

Yuchen Jin

@Yuchenj_UW

3 days ago

You may have heard that GLM-5.2 at 328 token/s is cool, How about 392? Databricks is now #1 in inference speed for GLM-5.2 on Artificial Analysis. It's a great model, and we did a lot of optimizations.

Yuchenj_UW's tweet photo. You may have heard that GLM-5.2 at 328 token/s is cool,

How about 392?

Databricks is now #1 in inference speed for GLM-5.2 on Artificial Analysis. It's a great model, and we did a lot of optimizations. https://t.co/S7MKjv1xnZ

90

1K

91

279

304K

vvomen181732 retweeted

alphaXiv

@askalphaxiv

4 days ago

A top tier open RL recipe for terminal agents have just dropped. As terminal agents are becoming the main interface for coding models, this paper, TMAX, shares a reproducible recipe for training agents that lets a 9B model reach 27% on Terminal-Bench 2.0, which beats all prior open RL recipes and even 32B baselines. By introducing better terminal practice, they generate 14.6k diverse Dockerized RL environments with controlled difficulty, domains, skills, personas, fixtures, and verifiers. Then they train small open models with a simple outcome-only DPPO recipe that is more stable for long multi-turn terminal tasks. The gains transfer beyond coding too, as they found that it interacts better with different harnesses, suggesting this recipe provides models strong general shell tool use skills rather than memorizing one setup.

askalphaxiv's tweet photo. A top tier open RL recipe for terminal agents have just dropped.

As terminal agents are becoming the main interface for coding models, this paper, TMAX, shares a reproducible recipe for training agents that lets a 9B model reach 27% on Terminal-Bench 2.0, which beats all prior open RL recipes and even 32B baselines.

By introducing better terminal practice, they generate 14.6k diverse Dockerized RL environments with controlled difficulty, domains, skills, personas, fixtures, and verifiers.

Then they train small open models with a simple outcome-only DPPO recipe that is more stable for long multi-turn terminal tasks.

The gains transfer beyond coding too, as they found that it interacts better with different harnesses, suggesting this recipe provides models strong general shell tool use skills rather than memorizing one setup.

9

249

30

184

12K

vvomen181732 retweeted

Arnaud Bertrand

@RnaudBertrand

8 days ago

Eric Schmidt saying the quiet part out loud: "What I don't like about [China's AI] is that it's all open source which means it's largely uncontrolled and not controlled in any way by us." He adds, "if that makes you feel any better," that only 2 or 3 countries can be independent AI powers. In other words, it's all about hegemony: the ideal scenario is a world where AI is controlled by the US - and the fewer countries that can resist that, the better. Src for the video: https://t.co/Gk5iAMtBqa

510

10K

3K

2K

863K

vvomen181732 retweeted

Zarathustra

@zarathustra5150

8 days ago

Well this is awkward…

79

24K

4K

1K

503K

vvomen181732 retweeted

alphaXiv

@askalphaxiv

7 days ago

“Fixed-Point Reasoners: Stable and Adaptive Deep Looped Transformers” While Looped Transformers can spend more depth on harder problems, they still need a good way to know when to stop. This paper makes the hidden state itself the stopping signal, basically it keeps looping until it converges to a fixed point. With pre-norm, residual scaling, and damping, FPRM becomes stable at large depths, adapts compute to task difficulty, and beats similar 7M reasoning models on Sudoku, Maze, ARC-AGI-1, and state tracking.

askalphaxiv's tweet photo. “Fixed-Point Reasoners: Stable and Adaptive Deep Looped Transformers”

While Looped Transformers can spend more depth on harder problems, they still need a good way to know when to stop.

This paper makes the hidden state itself the stopping signal, basically it keeps looping until it converges to a fixed point.

With pre-norm, residual scaling, and damping, FPRM becomes stable at large depths, adapts compute to task difficulty, and beats similar 7M reasoning models on Sudoku, Maze, ARC-AGI-1, and state tracking.

6

367

57

261

32K

vvomen181732 retweeted

Guohao Li 🐫

@guohao_li

8 days ago

just in case you don’t know, the company behind GLM 5.2 is publicly listed, and its stock has gone up 15× in the past six months

72

3K

186

2K

619K

vvomen181732 retweeted

Justin Hudson

@RISignal

9 days ago

Knowledge is the geometry. Reasoning is traversal through the geometry. Intelligence is the acquisition and stabilization of effective trajectories through that geometry.

RISignal's tweet photo. Knowledge is the geometry.

Reasoning is traversal through the geometry.

Intelligence is the acquisition and stabilization of effective trajectories through that geometry. https://t.co/zODgsshKmo

51

1K

185

997

46K

vvomen181732 retweeted

jietang

@jietang

10 days ago

@elonmusk @teortaxesTex won’t take that long

251

6K

464

552

2M

vvomen181732 retweeted

alphaXiv

@askalphaxiv

11 days ago

“Latent Thought Flow” This paper moves reasoning into continuous latent space, but instead of learning one hidden thought path, it learns a distribution over many paths. Using a continuous GFlowNet, Latent Thought Flow gives more probability to latent trajectories that are correct and cheap, so the model can stop early on easy problems and think longer on hard ones. In their experiments, they were able to obtain better accuracy while reducing reasoning length.

askalphaxiv's tweet photo. “Latent Thought Flow”

This paper moves reasoning into continuous latent space, but instead of learning one hidden thought path, it learns a distribution over many paths.

Using a continuous GFlowNet, Latent Thought Flow gives more probability to latent trajectories that are correct and cheap, so the model can stop early on easy problems and think longer on hard ones.

In their experiments, they were able to obtain better accuracy while reducing reasoning length.

7

332

55

208

13K

vvomen181732 retweeted

Jayden Teoh

@jayden_teoh_

12 days ago

Next-token prediction is myopic. What if transformers learn to predict their own next latent state? 🌠 We present 𝗡𝗲𝘅𝘁-𝗟𝗮𝘁𝗲𝗻𝘁 𝗣𝗿𝗲𝗱𝗶𝗰𝘁𝗶𝗼𝗻 (𝗡𝗲𝘅𝘁𝗟𝗮𝘁): a self-supervised learning method that teaches transformers to form compact world models for reasoning and planning. It also unlocks up to 3.3x faster inference via self-speculative decoding! 🚀

jayden_teoh_'s tweet photo. Next-token prediction is myopic. What if transformers learn to predict their own next latent state?

🌠 We present 𝗡𝗲𝘅𝘁-𝗟𝗮𝘁𝗲𝗻𝘁 𝗣𝗿𝗲𝗱𝗶𝗰𝘁𝗶𝗼𝗻 (𝗡𝗲𝘅𝘁𝗟𝗮𝘁): a self-supervised learning method that teaches transformers to form compact world models for reasoning and planning. It also unlocks up to 3.3x faster inference via self-speculative decoding! 🚀

47

2K

276

2K

279K

vvomen181732 retweeted

Z.ai @Zai_org

12 days ago

Introducing GLM-5.2: Frontier Intelligence, Open Weights - Significant improvements in coding and agentic tasks - Strong long-horizon capabilities with a 1M context window - Two levels of reasoning effort: GLM-5.2 (max) pushes the limits, while GLM-5.2 (high) strikes a strong balance between performance and token efficiency - MIT-licensed open weights - Same API pricing as GLM-5.1 Tech Blog: https://t.co/LAsxUdN0JZ Weights: https://t.co/g0A1C4UWx4 API: https://t.co/Kc3E22cbN7 Coding Plan: https://t.co/Nk8Y98HNhU Chat: https://t.co/WCqWT0qCQb

Zai_org's tweet photo. Introducing GLM-5.2: Frontier Intelligence, Open Weights

- Significant improvements in coding and agentic tasks
- Strong long-horizon capabilities with a 1M context window
- Two levels of reasoning effort: GLM-5.2 (max) pushes the limits, while GLM-5.2 (high) strikes a strong balance between performance and token efficiency
- MIT-licensed open weights
- Same API pricing as GLM-5.1

Tech Blog: https://t.co/LAsxUdN0JZ
Weights: https://t.co/g0A1C4UWx4
API: https://t.co/Kc3E22cbN7
Coding Plan: https://t.co/Nk8Y98HNhU
Chat: https://t.co/WCqWT0qCQb

685

13K

2K

5K

7M

vvomen @vvomen181732

13 days ago

@teortaxesTex @policytensor

0

3

0

101

vvomen181732 retweeted

elie

@eliebakouch

13 days ago

deepseek v1 -> v3 (no details in v4 about this) and k2 don't use muP and instead use naive N(0,0.006) initialization. so how do they do hyperparam selection? they basically fit scaling laws to get optimal batch size and learning rate. there are a bunch of papers detailing this but i like these: - deepseek llm: https://t.co/EufXkeBhZO (img 1) - towards greater leverage from inclusion AI: https://t.co/iNxckKF6Ox (img 2) there are a few issues with this approach. you basically never train with "optimal batch size" (the batch size that achieves the lowest loss in a fixed number of flops) but with "critical batch size" (the batch size that achieves the lowest loss in fixed wallclock gpu time, not the exact definition but good enough for intuition imo) one solution is to fix the batch size and do scaling laws for learning rate only like poolside did (img 3), and another is to fix the batch size with hardware constraints and scale the learning rate proportionally. the usual rule is if you scale the optimal batch size by k, you scale the optimal learning rate by sqrt(k). there are regimes where this is more or less true, and this rule doesn't have to hold depending on the optimizer you're using (there is a very nice blog series by @Jianlin_S about this) so why not use muP? still an open question imo. afaik there are only cohere and the falcon team that openly use muP in their training (maybe character ai as well?). the issue with muP is that you can transfer hyperparams across multiple axes: depth/width/number of experts/token horizon, and the original muP only gives you width transfer. more advanced techniques give you some transfer along other axes (depth muP, mu-muP, u-muP etc.) but it's not clear if at scale this leads to better loss than SP. it also changes the stability and learning dynamics, should be better but since it's not really proven at scale it's hard to blindly trust. this also varies with architecture changes, for instance the falcon team made some changes to make muP work with mamba models, and i don't think attention residual and depth muP are compatible, see https://t.co/ahQDqakVEI

eliebakouch's tweet photo. deepseek v1 -> v3 (no details in v4 about this) and k2 don't use muP and instead use naive N(0,0.006) initialization. so how do they do hyperparam selection?

they basically fit scaling laws to get optimal batch size and learning rate. there are a bunch of papers detailing this but i like these:
- deepseek llm: https://t.co/EufXkeBhZO (img 1)
- towards greater leverage from inclusion AI: https://t.co/iNxckKF6Ox (img 2)

there are a few issues with this approach. you basically never train with "optimal batch size" (the batch size that achieves the lowest loss in a fixed number of flops) but with "critical batch size" (the batch size that achieves the lowest loss in fixed wallclock gpu time, not the exact definition but good enough for intuition imo)

one solution is to fix the batch size and do scaling laws for learning rate only like poolside did (img 3), and another is to fix the batch size with hardware constraints and scale the learning rate proportionally. the usual rule is if you scale the optimal batch size by k, you scale the optimal learning rate by sqrt(k). there are regimes where this is more or less true, and this rule doesn't have to hold depending on the optimizer you're using (there is a very nice blog series by @Jianlin_S about this)

so why not use muP?

still an open question imo. afaik there are only cohere and the falcon team that openly use muP in their training (maybe character ai as well?). the issue with muP is that you can transfer hyperparams across multiple axes: depth/width/number of experts/token horizon, and the original muP only gives you width transfer. more advanced techniques give you some transfer along other axes (depth muP, mu-muP, u-muP etc.) but it's not clear if at scale this leads to better loss than SP. it also changes the stability and learning dynamics, should be better but since it's not really proven at scale it's hard to blindly trust. this also varies with architecture changes, for instance the falcon team made some changes to make muP work with mamba models, and i don't think attention residual and depth muP are compatible, see https://t.co/ahQDqakVEI

11

297

24

357

68K

vvomen181732 retweeted

Mustafa

@oprydai

13 days ago

disappeared like never existed.

34

1K

24

242

174K

vvomen181732 retweeted

Quan Nguyen

@stablequan

13 days ago

aint no way they are doing it fr 😭😭😭

23

791

16

78

101K

vvomen181732 retweeted

Arthur Mensch

@arthurmensch

13 days ago

It's actually le gros chaton

429

9K

811

420

2M

vvomen181732 retweeted

Peter Gostev

@petergostev

13 days ago

Le Chaton Fat: We have requested an urgent authorisation from the French government to extend the axis to display the result on the Agent Arena

petergostev's tweet photo. Le Chaton Fat: We have requested an urgent authorisation from the French government to extend the axis to display the result on the Agent Arena https://t.co/I3J8P82634

48

1K

67

219

194K

vvomen181732 retweeted

Guillaume Lample @ NeurIPS 2024

@GuillaumeLample

13 days ago

71

2K

143

159

356K

vvomen181732 retweeted

Tanishq Mathew Abraham, Ph.D.

@iScienceLuvr

13 days ago

This video by @jbhuang0604 is a compact but very informative dive into the progress of self-supervised learning over the past few decades. from IMAX in 1992 covering methods like MoCo, SimCLR, DINO, BYOL, MAE all the way up to LeJEPA in 2025 Highly recommend watching!

iScienceLuvr's tweet photo. This video by @jbhuang0604 is a compact but very informative dive into the progress of self-supervised learning over the past few decades.

from IMAX in 1992

covering methods like MoCo, SimCLR, DINO, BYOL, MAE

all the way up to LeJEPA in 2025

Highly recommend watching! https://t.co/4EkdfHHxWV

4

230

32

206

12K

vvomen

@vvomen181732

Last Seen Users on Sotwe

Trends for you

Most Popular Users