building_Finpeel

Google fired the guy that made the google workspace cli, because he made the google workspace cli. Lucky me, Google can't fire me. https://t.co/o15a6lOxec

164

497

MeetRahulViews retweeted

GDP

@bookwormengr

3 days ago

It is a shame this account has only 5k followers. Zhihu is pinnacle of Deep Learning blogging ❤️. This is Less Wrong of China, but far for deeper, far more technical. They make incredible diagrams, if you are a visual person. Very thankful to you Zhihu team!!!!

776

926

206K

Who to follow

Michael Crawley

@ceoCrawley

Michael Crawley. MDI Construction Inc., Crumbl Cookies, Jet’s Pizza, Tropical Smoothie (19 TN-KY Locations) Iceberg Inc. Heating & Plumbing, @driveinmovie

Austin Brown 🎭

@ABsPerspective

23 years old. Life as I see it.

Binance Angels

@BinanceAngels

Official News from @Binance Angel Program. Join a selected group of international volunteers who are passionate about Binance, and help make the difference!

MeetRahulViews retweeted

Zhihu Frontier

@ZhihuFrontier

3 days ago

Why Would GLM-5.2 Move Away From GRPO? 🌟Insights from Zhihu contributor 九老师 TL;DR: GLM-5.2 dropping GRPO does not mean GRPO is “bad.” It means the assumptions that made GRPO attractive for short LLM RL tasks may no longer hold for long-horizon agentic tasks. When rollouts get longer, environments get noisier, and credit assignment gets harder, PPO + value modeling starts looking useful again. The key question is not simply “why did GLM-5.2 stop using GRPO?” A better question is: why did GRPO become useful for LLM RL in the first place? If the reasons that made GRPO attractive no longer hold, then going back to PPO becomes natural. GRPO can be understood as a sampled-baseline method. Instead of training a separate value model, it samples multiple responses for the same prompt and uses the group average as a baseline. That is elegant. You get a relative reward signal without paying for a separate critic. In short tasks, this is very appealing. But there is a tradeoff.⚖️ PPO uses a learned value function, or critic. This critic is expensive and harder to tune. It also has its own problems: the policy keeps changing, so the value model is always trying to follow a moving target. That can introduce bias. GRPO avoids that by using an up-to-date sampled baseline. It is closer to low-bias, but it tends to have higher variance. For early LLM RL tasks, that tradeoff made sense: • Rollouts were short • Final rewards were clear • Memory savings mattered a lot • Multiple samples per prompt were manageable • Math/code tasks were relatively easy to verify That is why GRPO worked so well for many short, verifiable reasoning tasks. But long-horizon agentic tasks change the game. 🎮 A long agent task can look much more like a game environment: • Many steps • Tool calls • Partial progress • Delayed failure • Noisy observations • Intermediate rewards • Wrong action penalties • Context compression • Different paths to the same final answer This is where GRPO starts to struggle. The biggest issue is credit assignment. In GRPO, the final reward is applied broadly across the whole trajectory. If a task succeeds, many tokens get rewarded. If it fails, many tokens get punished. But in a long task, that is too coarse. Maybe the first half was bad, but the final recovery was good. Maybe one tool call at step 30 caused failure at step 100. Maybe two successful trajectories are not really comparable because one used 4K tokens and another used 200K tokens with heavy tool use and context compression. GRPO sees the final outcome. It does not naturally know which step actually mattered. That creates high variance. In short tasks, group comparison works well. In long tasks, group sampling can collapse into two bad cases: 1. All samples fail The whole expensive rollout gives almost no useful training signal. 2. Only one sample succeeds That single success may be luck, but GRPO may treat it as a strong positive signal and over-reward the trajectory. Both are dangerous for long agentic training. This is where PPO’s critic becomes valuable again. A value model can learn expected value under noisy states. It can provide denser feedback before the full rollout ends. It is more expensive, but it helps with long-horizon credit assignment. So the author’s view is: GRPO is not being rejected because it was wrong. It is being outgrown by the task format. For short, deterministic, verifiable tasks, GRPO remains strong. For long, noisy, tool-heavy agentic tasks, PPO-style value modeling may simply be the better fit. The “compaction problem” mentioned around long contexts is likely more of a symptom. The deeper issue is that GRPO’s weaknesses become costly when trajectories are long and states keep changing. Could GRPO still work? Yes, if paired with a strong Process Reward Model. The author points out that DeepSeek MathV2 uses this direction. Process-level signals can help fix GRPO’s sparse-reward weakness. But without that, returning to PPO makes sense. 🎯The bigger takeaway: GRPO saved the value model. PPO brings it back. GRPO’s main advantage was efficiency. It removed the critic and saved resources. But for long-horizon agentic tasks, the critic’s ability to generalize and assign credit may be worth the cost again. In the Agent era, RL for LLMs is becoming less like solving a short math problem and more like training an agent to play a long, noisy game. And for that world, value models may still be the soul of RL. 🔗Full Reading (CN): https://t.co/hf1GsDBc3e

ZhihuFrontier's tweet photo. Why Would GLM-5.2 Move Away From GRPO?
🌟Insights from Zhihu contributor 九老师

TL;DR: GLM-5.2 dropping GRPO does not mean GRPO is “bad.” It means the assumptions that made GRPO attractive for short LLM RL tasks may no longer hold for long-horizon agentic tasks. When rollouts get longer, environments get noisier, and credit assignment gets harder, PPO + value modeling starts looking useful again.

The key question is not simply “why did GLM-5.2 stop using GRPO?” A better question is: why did GRPO become useful for LLM RL in the first place?
If the reasons that made GRPO attractive no longer hold, then going back to PPO becomes natural.
GRPO can be understood as a sampled-baseline method. Instead of training a separate value model, it samples multiple responses for the same prompt and uses the group average as a baseline.
That is elegant. You get a relative reward signal without paying for a separate critic. In short tasks, this is very appealing.

But there is a tradeoff.⚖️
PPO uses a learned value function, or critic. This critic is expensive and harder to tune. It also has its own problems: the policy keeps changing, so the value model is always trying to follow a moving target. That can introduce bias.
GRPO avoids that by using an up-to-date sampled baseline. It is closer to low-bias, but it tends to have higher variance.
For early LLM RL tasks, that tradeoff made sense:
• Rollouts were short
• Final rewards were clear
• Memory savings mattered a lot
• Multiple samples per prompt were manageable
• Math/code tasks were relatively easy to verify
That is why GRPO worked so well for many short, verifiable reasoning tasks.

But long-horizon agentic tasks change the game. 🎮
A long agent task can look much more like a game environment:
• Many steps
• Tool calls
• Partial progress
• Delayed failure
• Noisy observations
• Intermediate rewards
• Wrong action penalties
• Context compression
• Different paths to the same final answer
This is where GRPO starts to struggle.

The biggest issue is credit assignment. In GRPO, the final reward is applied broadly across the whole trajectory. If a task succeeds, many tokens get rewarded. If it fails, many tokens get punished.
But in a long task, that is too coarse.

Maybe the first half was bad, but the final recovery was good. Maybe one tool call at step 30 caused failure at step 100. Maybe two successful trajectories are not really comparable because one used 4K tokens and another used 200K tokens with heavy tool use and context compression.

GRPO sees the final outcome. It does not naturally know which step actually mattered.
That creates high variance.
In short tasks, group comparison works well. In long tasks, group sampling can collapse into two bad cases:
1. All samples fail
The whole expensive rollout gives almost no useful training signal.
2. Only one sample succeeds
That single success may be luck, but GRPO may treat it as a strong positive signal and over-reward the trajectory.
Both are dangerous for long agentic training.
This is where PPO’s critic becomes valuable again. A value model can learn expected value under noisy states. It can provide denser feedback before the full rollout ends. It is more expensive, but it helps with long-horizon credit assignment.
So the author’s view is: GRPO is not being rejected because it was wrong. It is being outgrown by the task format.

For short, deterministic, verifiable tasks, GRPO remains strong.
For long, noisy, tool-heavy agentic tasks, PPO-style value modeling may simply be the better fit.

The “compaction problem” mentioned around long contexts is likely more of a symptom. The deeper issue is that GRPO’s weaknesses become costly when trajectories are long and states keep changing.
Could GRPO still work? Yes, if paired with a strong Process Reward Model. The author points out that DeepSeek MathV2 uses this direction. Process-level signals can help fix GRPO’s sparse-reward weakness.

But without that, returning to PPO makes sense.
🎯The bigger takeaway:
GRPO saved the value model. PPO brings it back.
GRPO’s main advantage was efficiency. It removed the critic and saved resources. But for long-horizon agentic tasks, the critic’s ability to generalize and assign credit may be worth the cost again.
In the Agent era, RL for LLMs is becoming less like solving a short math problem and more like training an agent to play a long, noisy game.
And for that world, value models may still be the soul of RL.

🔗Full Reading (CN):
https://t.co/hf1GsDBc3e

798

106

262K

MeetRahulViews retweeted

Philip Kiely

@philipkiely

5 days ago

https://t.co/LlNSUlxWLa

135

528K

MeetRahulViews retweeted

ℏεsam

@Hesamation

6 days ago

CEO of Vercel. I don't remember the last time the community has unanymously praised a new model release. They really cooked.

159

12K

MeetRahulViews retweeted

shyamal

@shyamalanadkat

6 days ago

after close to four years at @openai, i moved from the bay area to india earlier this year. i still believe deeply in ensuring true superintelligence accelerates science and remains accessible and beneficial to all. having grown up here, i've also always felt deeply connected to the ecosystem here. over the past several weeks, i've been speaking with researchers, engineers, and thinkers across india and apac. it's become clear that there are many who want to build the future from here. moving back felt like the counterintuitive choice. i no longer think that's true. what's been missing is the belief that you can build institutions of global consequence from anywhere. and more importantly, the ambition and the will to pursue ideas that seem impossibly large at first. this may be a once in a generation opportunity. more to come soon. DMs open if this resonates.

354

406

601K

MeetRahulViews retweeted

Greg Brockman

@gdb

9 days ago

Rust is great. We’re making a $600,000 commitment to the Rust Foundation:

125

177

298

425K

building_Finpeel @MeetRahulViews

7 days ago

launching https://t.co/FVotENEHtQ

building_Finpeel @MeetRahulViews

7 days ago

Launching https://t.co/ywf8kSFnUT

building_Finpeel @MeetRahulViews

7 days ago

Going live with https://t.co/QIF89RC1d9

MeetRahulViews retweeted

Noam Shazeer

@NoamShazeer

10 days ago

I’m excited to share that I’ll be joining OpenAI and look forward to working with the exceptional team there. It was a difficult decision to move on. I’m incredibly proud of the amazing team at Google and everything we’ve built together. It has been an honor and a pleasure to work with all of you.

985

16K

869

MeetRahulViews retweeted

Pavel Durov

@durov

11 days ago

India’s IT ministry banned Telegram for one week because some users shared leaked exam questions. This punishes 150M+ ordinary Telegram users in India — not the insiders who leaked the exam materials. And the ban hasn't stopped anything. The leaks just moved to other apps.

56K

12K

MeetRahulViews retweeted

Pavel Durov

@durov

11 days ago

Indian telecom Reliance is sabotaging access to Telegram for millions of users OUTSIDE India (including the UAE) via a rogue method called BGP hijacking. The sabotage seems intentional, as Reliance has ignored multiple reports. This may be part of a competitive war, as Reliance is partially owned by Meta — the company behind WhatsApp. Network operators are advised to reject unauthorized BGP announcements from Reliance (AS18101) to prevent route hijacks and ensure stable Internet access for their users. Such abuse of global Internet routing is alarming. I wouldn’t be surprised if Reliance/WhatsApp were also behind the recent lobbying effort to ban Telegram in India.

51K

MeetRahulViews retweeted

Michael Truell

@mntruell

11 days ago

Lots to do together. Excited to be joining forces with @SpaceX to build useful AI.

730

13K

483

MeetRahulViews retweeted

Trade Whisperer

@TradexWhisperer

12 days ago

$MU New All Time High. Up ~12% today. $62 to $1,096. Don't ever doubt my conviction. $1,500 incoming.

592

61K

MeetRahulViews retweeted

David Ondrej

@DavidOndrej1

12 days ago

the most entertaining outcome is the most likely

152

MeetRahulViews retweeted

Harrison Kinsley

@Sentdex

13 days ago

While closed source AI is in shambles, open source is having one of the best weeks of all time. Z ai GLM 5.2 Minimax M3 Kimi 2.7 code

138

382

125K

MeetRahulViews retweeted

Kimi.ai @Kimi_Moonshot

15 days ago

🌘 Kimi-K2.7-Code, our latest coding model, is now released and open-sourced! 🔷 Improved coding & agent performance over K2.6: +21.8% on Kimi Code Bench v2, +11.0% on Program Bench, and +31.5% on MLS Bench Lite. 🔷 Reasoning efficiency: Less overthinking, with 30% lower reasoning-token usage compared to K2.6. 🔷 Long-horizon coding: Improved instruction following, higher end-to-end coding task success rates. ⚡️ 6x High-Speed Mode coming soon! 🔌 Available today via Kimi API and Kimi Code. 🔗 Kimi Code: https://t.co/uvoSJKyGCY 🔗 API: https://t.co/EOZkbOwCN4

Kimi_Moonshot's tweet photo. 🌘 Kimi-K2.7-Code, our latest coding model, is now released and open-sourced!

🔷 Improved coding & agent performance over K2.6: +21.8% on Kimi Code Bench v2, +11.0% on Program Bench, and +31.5% on MLS Bench Lite.
🔷 Reasoning efficiency: Less overthinking, with 30% lower reasoning-token usage compared to K2.6.
🔷 Long-horizon coding: Improved instruction following, higher end-to-end coding task success rates.

⚡️ 6x High-Speed Mode coming soon!
🔌 Available today via Kimi API and Kimi Code.

🔗 Kimi Code: https://t.co/uvoSJKyGCY
🔗 API: https://t.co/EOZkbOwCN4

644

14K

building_Finpeel

@MeetRahulViews

Who to follow

Last Seen Users on Sotwe

Trends for you

Most Popular Users