Google fired the guy that made the google workspace cli, because he made the google workspace cli.
Lucky me, Google can't fire me. https://t.co/o15a6lOxec
It is a shame this account has only 5k followers.
Zhihu is pinnacle of Deep Learning blogging ❤️.
This is Less Wrong of China, but far for deeper, far more technical.
They make incredible diagrams, if you are a visual person.
Very thankful to you Zhihu team!!!!
Why Would GLM-5.2 Move Away From GRPO?
🌟Insights from Zhihu contributor 九老师
TL;DR: GLM-5.2 dropping GRPO does not mean GRPO is “bad.” It means the assumptions that made GRPO attractive for short LLM RL tasks may no longer hold for long-horizon agentic tasks. When rollouts get longer, environments get noisier, and credit assignment gets harder, PPO + value modeling starts looking useful again.
The key question is not simply “why did GLM-5.2 stop using GRPO?” A better question is: why did GRPO become useful for LLM RL in the first place?
If the reasons that made GRPO attractive no longer hold, then going back to PPO becomes natural.
GRPO can be understood as a sampled-baseline method. Instead of training a separate value model, it samples multiple responses for the same prompt and uses the group average as a baseline.
That is elegant. You get a relative reward signal without paying for a separate critic. In short tasks, this is very appealing.
But there is a tradeoff.⚖️
PPO uses a learned value function, or critic. This critic is expensive and harder to tune. It also has its own problems: the policy keeps changing, so the value model is always trying to follow a moving target. That can introduce bias.
GRPO avoids that by using an up-to-date sampled baseline. It is closer to low-bias, but it tends to have higher variance.
For early LLM RL tasks, that tradeoff made sense:
• Rollouts were short
• Final rewards were clear
• Memory savings mattered a lot
• Multiple samples per prompt were manageable
• Math/code tasks were relatively easy to verify
That is why GRPO worked so well for many short, verifiable reasoning tasks.
But long-horizon agentic tasks change the game. 🎮
A long agent task can look much more like a game environment:
• Many steps
• Tool calls
• Partial progress
• Delayed failure
• Noisy observations
• Intermediate rewards
• Wrong action penalties
• Context compression
• Different paths to the same final answer
This is where GRPO starts to struggle.
The biggest issue is credit assignment. In GRPO, the final reward is applied broadly across the whole trajectory. If a task succeeds, many tokens get rewarded. If it fails, many tokens get punished.
But in a long task, that is too coarse.
Maybe the first half was bad, but the final recovery was good. Maybe one tool call at step 30 caused failure at step 100. Maybe two successful trajectories are not really comparable because one used 4K tokens and another used 200K tokens with heavy tool use and context compression.
GRPO sees the final outcome. It does not naturally know which step actually mattered.
That creates high variance.
In short tasks, group comparison works well. In long tasks, group sampling can collapse into two bad cases:
1. All samples fail
The whole expensive rollout gives almost no useful training signal.
2. Only one sample succeeds
That single success may be luck, but GRPO may treat it as a strong positive signal and over-reward the trajectory.
Both are dangerous for long agentic training.
This is where PPO’s critic becomes valuable again. A value model can learn expected value under noisy states. It can provide denser feedback before the full rollout ends. It is more expensive, but it helps with long-horizon credit assignment.
So the author’s view is: GRPO is not being rejected because it was wrong. It is being outgrown by the task format.
For short, deterministic, verifiable tasks, GRPO remains strong.
For long, noisy, tool-heavy agentic tasks, PPO-style value modeling may simply be the better fit.
The “compaction problem” mentioned around long contexts is likely more of a symptom. The deeper issue is that GRPO’s weaknesses become costly when trajectories are long and states keep changing.
Could GRPO still work? Yes, if paired with a strong Process Reward Model. The author points out that DeepSeek MathV2 uses this direction. Process-level signals can help fix GRPO’s sparse-reward weakness.
But without that, returning to PPO makes sense.
🎯The bigger takeaway:
GRPO saved the value model. PPO brings it back.
GRPO’s main advantage was efficiency. It removed the critic and saved resources. But for long-horizon agentic tasks, the critic’s ability to generalize and assign credit may be worth the cost again.
In the Agent era, RL for LLMs is becoming less like solving a short math problem and more like training an agent to play a long, noisy game.
And for that world, value models may still be the soul of RL.
🔗Full Reading (CN):
https://t.co/hf1GsDBc3e
after close to four years at @openai, i moved from the bay area to india earlier this year. i still believe deeply in ensuring true superintelligence accelerates science and remains accessible and beneficial to all. having grown up here, i've also always felt deeply connected to the ecosystem here.
over the past several weeks, i've been speaking with researchers, engineers, and thinkers across india and apac. it's become clear that there are many who want to build the future from here. moving back felt like the counterintuitive choice. i no longer think that's true.
what's been missing is the belief that you can build institutions of global consequence from anywhere. and more importantly, the ambition and the will to pursue ideas that seem impossibly large at first. this may be a once in a generation opportunity.
more to come soon. DMs open if this resonates.
I’m excited to share that I’ll be joining OpenAI and look forward to working with the exceptional team there.
It was a difficult decision to move on. I’m incredibly proud of the amazing team at Google and everything we’ve built together. It has been an honor and a pleasure to work with all of you.
India’s IT ministry banned Telegram for one week because some users shared leaked exam questions.
This punishes 150M+ ordinary Telegram users in India — not the insiders who leaked the exam materials.
And the ban hasn't stopped anything. The leaks just moved to other apps.
Indian telecom Reliance is sabotaging access to Telegram for millions of users OUTSIDE India (including the UAE) via a rogue method called BGP hijacking.
The sabotage seems intentional, as Reliance has ignored multiple reports.
This may be part of a competitive war, as Reliance is partially owned by Meta — the company behind WhatsApp.
Network operators are advised to reject unauthorized BGP announcements from Reliance (AS18101) to prevent route hijacks and ensure stable Internet access for their users.
Such abuse of global Internet routing is alarming. I wouldn’t be surprised if Reliance/WhatsApp were also behind the recent lobbying effort to ban Telegram in India.
🌘 Kimi-K2.7-Code, our latest coding model, is now released and open-sourced!
🔷 Improved coding & agent performance over K2.6: +21.8% on Kimi Code Bench v2, +11.0% on Program Bench, and +31.5% on MLS Bench Lite.
🔷 Reasoning efficiency: Less overthinking, with 30% lower reasoning-token usage compared to K2.6.
🔷 Long-horizon coding: Improved instruction following, higher end-to-end coding task success rates.
⚡️ 6x High-Speed Mode coming soon!
🔌 Available today via Kimi API and Kimi Code.
🔗 Kimi Code: https://t.co/uvoSJKyGCY
🔗 API: https://t.co/EOZkbOwCN4