You may have heard that GLM-5.2 at 328 token/s is cool,
How about 392?
Databricks is now #1 in inference speed for GLM-5.2 on Artificial Analysis. It's a great model, and we did a lot of optimizations.
A top tier open RL recipe for terminal agents have just dropped.
As terminal agents are becoming the main interface for coding models, this paper, TMAX, shares a reproducible recipe for training agents that lets a 9B model reach 27% on Terminal-Bench 2.0, which beats all prior open RL recipes and even 32B baselines.
By introducing better terminal practice, they generate 14.6k diverse Dockerized RL environments with controlled difficulty, domains, skills, personas, fixtures, and verifiers.
Then they train small open models with a simple outcome-only DPPO recipe that is more stable for long multi-turn terminal tasks.
The gains transfer beyond coding too, as they found that it interacts better with different harnesses, suggesting this recipe provides models strong general shell tool use skills rather than memorizing one setup.
Eric Schmidt saying the quiet part out loud: "What I don't like about [China's AI] is that it's all open source which means it's largely uncontrolled and not controlled in any way by us."
He adds, "if that makes you feel any better," that only 2 or 3 countries can be independent AI powers.
In other words, it's all about hegemony: the ideal scenario is a world where AI is controlled by the US - and the fewer countries that can resist that, the better.
Src for the video: https://t.co/Gk5iAMtBqa
“Fixed-Point Reasoners: Stable and Adaptive Deep Looped Transformers”
While Looped Transformers can spend more depth on harder problems, they still need a good way to know when to stop.
This paper makes the hidden state itself the stopping signal, basically it keeps looping until it converges to a fixed point.
With pre-norm, residual scaling, and damping, FPRM becomes stable at large depths, adapts compute to task difficulty, and beats similar 7M reasoning models on Sudoku, Maze, ARC-AGI-1, and state tracking.
Knowledge is the geometry.
Reasoning is traversal through the geometry.
Intelligence is the acquisition and stabilization of effective trajectories through that geometry.
“Latent Thought Flow”
This paper moves reasoning into continuous latent space, but instead of learning one hidden thought path, it learns a distribution over many paths.
Using a continuous GFlowNet, Latent Thought Flow gives more probability to latent trajectories that are correct and cheap, so the model can stop early on easy problems and think longer on hard ones.
In their experiments, they were able to obtain better accuracy while reducing reasoning length.
Next-token prediction is myopic. What if transformers learn to predict their own next latent state?
🌠 We present 𝗡𝗲𝘅𝘁-𝗟𝗮𝘁𝗲𝗻𝘁 𝗣𝗿𝗲𝗱𝗶𝗰𝘁𝗶𝗼𝗻 (𝗡𝗲𝘅𝘁𝗟𝗮𝘁): a self-supervised learning method that teaches transformers to form compact world models for reasoning and planning. It also unlocks up to 3.3x faster inference via self-speculative decoding! 🚀
Introducing GLM-5.2: Frontier Intelligence, Open Weights
- Significant improvements in coding and agentic tasks
- Strong long-horizon capabilities with a 1M context window
- Two levels of reasoning effort: GLM-5.2 (max) pushes the limits, while GLM-5.2 (high) strikes a strong balance between performance and token efficiency
- MIT-licensed open weights
- Same API pricing as GLM-5.1
Tech Blog: https://t.co/LAsxUdN0JZ
Weights: https://t.co/g0A1C4UWx4
API: https://t.co/Kc3E22cbN7
Coding Plan: https://t.co/Nk8Y98HNhU
Chat: https://t.co/WCqWT0qCQb
deepseek v1 -> v3 (no details in v4 about this) and k2 don't use muP and instead use naive N(0,0.006) initialization. so how do they do hyperparam selection?
they basically fit scaling laws to get optimal batch size and learning rate. there are a bunch of papers detailing this but i like these:
- deepseek llm: https://t.co/EufXkeBhZO (img 1)
- towards greater leverage from inclusion AI: https://t.co/iNxckKF6Ox (img 2)
there are a few issues with this approach. you basically never train with "optimal batch size" (the batch size that achieves the lowest loss in a fixed number of flops) but with "critical batch size" (the batch size that achieves the lowest loss in fixed wallclock gpu time, not the exact definition but good enough for intuition imo)
one solution is to fix the batch size and do scaling laws for learning rate only like poolside did (img 3), and another is to fix the batch size with hardware constraints and scale the learning rate proportionally. the usual rule is if you scale the optimal batch size by k, you scale the optimal learning rate by sqrt(k). there are regimes where this is more or less true, and this rule doesn't have to hold depending on the optimizer you're using (there is a very nice blog series by @Jianlin_S about this)
so why not use muP?
still an open question imo. afaik there are only cohere and the falcon team that openly use muP in their training (maybe character ai as well?). the issue with muP is that you can transfer hyperparams across multiple axes: depth/width/number of experts/token horizon, and the original muP only gives you width transfer. more advanced techniques give you some transfer along other axes (depth muP, mu-muP, u-muP etc.) but it's not clear if at scale this leads to better loss than SP. it also changes the stability and learning dynamics, should be better but since it's not really proven at scale it's hard to blindly trust. this also varies with architecture changes, for instance the falcon team made some changes to make muP work with mamba models, and i don't think attention residual and depth muP are compatible, see https://t.co/ahQDqakVEI
This video by @jbhuang0604 is a compact but very informative dive into the progress of self-supervised learning over the past few decades.
from IMAX in 1992
covering methods like MoCo, SimCLR, DINO, BYOL, MAE
all the way up to LeJEPA in 2025
Highly recommend watching!