I felt a great disturbance in the Force, as if millions of voices suddenly cried out in terror and were suddenly silenced.
I fear something terrible has happened.
A super long overdue (3+ years?) post on scaling laws.
Compute is expensive. Scaling laws are a way to help us reason about the optimal compute allocation between data and model size before committing to a large run.
The post covers what scaling laws predict, how compute-optimal allocation works, why Kaplan et al. and Chinchilla disagree, and how data limits + fitting details make extrapolation tricky.
https://t.co/HP26eJvjHB
Anthropic’s fear campaign around Mythos has almost single-handedly slowed the normal release of GPT-5.6, while also making government approval of frontier model access the new normal for US AI labs.
It’s not hard to foresee that this will inevitably lead to:
1. Frontier models will release slower. The days when the industry was shipping new models every month are over.
2. Frontier labs will be compelled to build “will the government permit release” into their training process as a binding constraint.
3. A caste-like pattern of access will take hold across the entire industry.
This is precisely why fear-based marketing and geopolitical posturing in the tech sector has always been a dangerous game to play.
In their OCR 4 launch this week, Mistral shared a significantly lower score for Chandra 2 than you get from our repo or by running our public code.
They also omitted Infinity Parser, which reports 87.6%, from their olmocr comparison.
If you need more reasons to tell your friends why to buy a Tesla, JD Power has a few. Tesla was 'unofficially' ranked 3rd IQS (initial quality), 1st in EVX (EV experience) https://t.co/H4s5ewHepf https://t.co/H4s5ewHepf
Apple’s plans are fluid given the component supply chain right now, but it aims to launch the M6 this year, the M7 by the middle of next year, the M7 Pro and M7 Max in late 2027 and the M7 Ultra in 2028.
I got DeepSeek-V4-Flash MTP speculative decoding actually working on 2× RTX PRO 6000
+38% single-stream throughput.
It was declared “broken on SM120”
The kernels weren’t the problem. It was one mis-routed quantization format in the loader
←on 45tok/s off 98tok/s→
Why Would GLM-5.2 Move Away From GRPO?
🌟Insights from Zhihu contributor 九老师
TL;DR: GLM-5.2 dropping GRPO does not mean GRPO is “bad.” It means the assumptions that made GRPO attractive for short LLM RL tasks may no longer hold for long-horizon agentic tasks. When rollouts get longer, environments get noisier, and credit assignment gets harder, PPO + value modeling starts looking useful again.
The key question is not simply “why did GLM-5.2 stop using GRPO?” A better question is: why did GRPO become useful for LLM RL in the first place?
If the reasons that made GRPO attractive no longer hold, then going back to PPO becomes natural.
GRPO can be understood as a sampled-baseline method. Instead of training a separate value model, it samples multiple responses for the same prompt and uses the group average as a baseline.
That is elegant. You get a relative reward signal without paying for a separate critic. In short tasks, this is very appealing.
But there is a tradeoff.⚖️
PPO uses a learned value function, or critic. This critic is expensive and harder to tune. It also has its own problems: the policy keeps changing, so the value model is always trying to follow a moving target. That can introduce bias.
GRPO avoids that by using an up-to-date sampled baseline. It is closer to low-bias, but it tends to have higher variance.
For early LLM RL tasks, that tradeoff made sense:
• Rollouts were short
• Final rewards were clear
• Memory savings mattered a lot
• Multiple samples per prompt were manageable
• Math/code tasks were relatively easy to verify
That is why GRPO worked so well for many short, verifiable reasoning tasks.
But long-horizon agentic tasks change the game. 🎮
A long agent task can look much more like a game environment:
• Many steps
• Tool calls
• Partial progress
• Delayed failure
• Noisy observations
• Intermediate rewards
• Wrong action penalties
• Context compression
• Different paths to the same final answer
This is where GRPO starts to struggle.
The biggest issue is credit assignment. In GRPO, the final reward is applied broadly across the whole trajectory. If a task succeeds, many tokens get rewarded. If it fails, many tokens get punished.
But in a long task, that is too coarse.
Maybe the first half was bad, but the final recovery was good. Maybe one tool call at step 30 caused failure at step 100. Maybe two successful trajectories are not really comparable because one used 4K tokens and another used 200K tokens with heavy tool use and context compression.
GRPO sees the final outcome. It does not naturally know which step actually mattered.
That creates high variance.
In short tasks, group comparison works well. In long tasks, group sampling can collapse into two bad cases:
1. All samples fail
The whole expensive rollout gives almost no useful training signal.
2. Only one sample succeeds
That single success may be luck, but GRPO may treat it as a strong positive signal and over-reward the trajectory.
Both are dangerous for long agentic training.
This is where PPO’s critic becomes valuable again. A value model can learn expected value under noisy states. It can provide denser feedback before the full rollout ends. It is more expensive, but it helps with long-horizon credit assignment.
So the author’s view is: GRPO is not being rejected because it was wrong. It is being outgrown by the task format.
For short, deterministic, verifiable tasks, GRPO remains strong.
For long, noisy, tool-heavy agentic tasks, PPO-style value modeling may simply be the better fit.
The “compaction problem” mentioned around long contexts is likely more of a symptom. The deeper issue is that GRPO’s weaknesses become costly when trajectories are long and states keep changing.
Could GRPO still work? Yes, if paired with a strong Process Reward Model. The author points out that DeepSeek MathV2 uses this direction. Process-level signals can help fix GRPO’s sparse-reward weakness.
But without that, returning to PPO makes sense.
🎯The bigger takeaway:
GRPO saved the value model. PPO brings it back.
GRPO’s main advantage was efficiency. It removed the critic and saved resources. But for long-horizon agentic tasks, the critic’s ability to generalize and assign credit may be worth the cost again.
In the Agent era, RL for LLMs is becoming less like solving a short math problem and more like training an agent to play a long, noisy game.
And for that world, value models may still be the soul of RL.
🔗Full Reading (CN):
https://t.co/hf1GsDBc3e
Xiaomi EV Home Charging Robotic Arm, a seamless, fully automated home charging experience.
"Human x Car x Home" smart ecosystem. Remote control, right from your smartphone.
When looking at different companies using "AI Agents" - you have to ask yourself, what's their competitive advantage? If everyone is using the same agents, how are their products differentiated? The models are all trained on the same target distribution...