Artificially Intelligent

@ArtiIntelligent

Insanity is doing the same thing over and over and expecting different results...

the Milky Way

Joined February 2025

6.9K Following

314 Followers

1.3K Posts

Artificially Intelligent

@ArtiIntelligent

about 1 hour ago

@Srasgon technically yes, the purchasing power has changed! Previously what you could purchased for $50 now costs ~$75 ;)

ArtiIntelligent retweeted

emozilla

@theemozilla

1 day ago

I felt a great disturbance in the Force, as if millions of voices suddenly cried out in terror and were suddenly silenced. I fear something terrible has happened.

theemozilla's tweet photo. I felt a great disturbance in the Force, as if millions of voices suddenly cried out in terror and were suddenly silenced.

I fear something terrible has happened. https://t.co/ARIqz4EiGU

103

382

ArtiIntelligent retweeted

Lilian Weng

@lilianweng

2 days ago

A super long overdue (3+ years?) post on scaling laws. Compute is expensive. Scaling laws are a way to help us reason about the optimal compute allocation between data and model size before committing to a large run. The post covers what scaling laws predict, how compute-optimal allocation works, why Kaplan et al. and Chinchilla disagree, and how data limits + fitting details make extrapolation tricky. https://t.co/HP26eJvjHB

562

389K

ArtiIntelligent retweeted

@MissMi1973

1 day ago

Anthropic’s fear campaign around Mythos has almost single-handedly slowed the normal release of GPT-5.6, while also making government approval of frontier model access the new normal for US AI labs. It’s not hard to foresee that this will inevitably lead to: 1. Frontier models will release slower. The days when the industry was shipping new models every month are over. 2. Frontier labs will be compelled to build “will the government permit release” into their training process as a binding constraint. 3. A caste-like pattern of access will take hold across the entire industry. This is precisely why fear-based marketing and geopolitical posturing in the tech sector has always been a dangerous game to play.

MissMi1973's tweet photo. Anthropic’s fear campaign around Mythos has almost single-handedly slowed the normal release of GPT-5.6, while also making government approval of frontier model access the new normal for US AI labs.

It’s not hard to foresee that this will inevitably lead to:

1. Frontier models will release slower. The days when the industry was shipping new models every month are over.
2. Frontier labs will be compelled to build “will the government permit release” into their training process as a binding constraint.
3. A caste-like pattern of access will take hold across the entire industry.

This is precisely why fear-based marketing and geopolitical posturing in the tech sector has always been a dangerous game to play.

672

117

142K

ArtiIntelligent retweeted

Vik Paruchuri

@VikParuchuri

1 day ago

In their OCR 4 launch this week, Mistral shared a significantly lower score for Chandra 2 than you get from our repo or by running our public code. They also omitted Infinity Parser, which reports 87.6%, from their olmocr comparison.

280

121

46K

ArtiIntelligent retweeted

Lars

@larsmoravy

2 days ago

If you need more reasons to tell your friends why to buy a Tesla, JD Power has a few. Tesla was 'unofficially' ranked 3rd IQS (initial quality), 1st in EVX (EV experience) https://t.co/H4s5ewHepf https://t.co/H4s5ewHepf

106

264

210K

ArtiIntelligent retweeted

Mark Gurman

@markgurman

2 days ago

Apple’s plans are fluid given the component supply chain right now, but it aims to launch the M6 this year, the M7 by the middle of next year, the M7 Pro and M7 Max in late 2027 and the M7 Ultra in 2028.

597

96K

ArtiIntelligent retweeted

Hikari∣LocalLLM⚡

@Hikari_07_jp

3 days ago

I got DeepSeek-V4-Flash MTP speculative decoding actually working on 2× RTX PRO 6000 +38% single-stream throughput. It was declared “broken on SM120” The kernels weren’t the problem. It was one mis-routed quantization format in the loader ←on 45tok/s off 98tok/s→

138

10K

ArtiIntelligent retweeted

Zhihu Frontier

@ZhihuFrontier

4 days ago

Why Would GLM-5.2 Move Away From GRPO? 🌟Insights from Zhihu contributor 九老师 TL;DR: GLM-5.2 dropping GRPO does not mean GRPO is “bad.” It means the assumptions that made GRPO attractive for short LLM RL tasks may no longer hold for long-horizon agentic tasks. When rollouts get longer, environments get noisier, and credit assignment gets harder, PPO + value modeling starts looking useful again. The key question is not simply “why did GLM-5.2 stop using GRPO?” A better question is: why did GRPO become useful for LLM RL in the first place? If the reasons that made GRPO attractive no longer hold, then going back to PPO becomes natural. GRPO can be understood as a sampled-baseline method. Instead of training a separate value model, it samples multiple responses for the same prompt and uses the group average as a baseline. That is elegant. You get a relative reward signal without paying for a separate critic. In short tasks, this is very appealing. But there is a tradeoff.⚖️ PPO uses a learned value function, or critic. This critic is expensive and harder to tune. It also has its own problems: the policy keeps changing, so the value model is always trying to follow a moving target. That can introduce bias. GRPO avoids that by using an up-to-date sampled baseline. It is closer to low-bias, but it tends to have higher variance. For early LLM RL tasks, that tradeoff made sense: • Rollouts were short • Final rewards were clear • Memory savings mattered a lot • Multiple samples per prompt were manageable • Math/code tasks were relatively easy to verify That is why GRPO worked so well for many short, verifiable reasoning tasks. But long-horizon agentic tasks change the game. 🎮 A long agent task can look much more like a game environment: • Many steps • Tool calls • Partial progress • Delayed failure • Noisy observations • Intermediate rewards • Wrong action penalties • Context compression • Different paths to the same final answer This is where GRPO starts to struggle. The biggest issue is credit assignment. In GRPO, the final reward is applied broadly across the whole trajectory. If a task succeeds, many tokens get rewarded. If it fails, many tokens get punished. But in a long task, that is too coarse. Maybe the first half was bad, but the final recovery was good. Maybe one tool call at step 30 caused failure at step 100. Maybe two successful trajectories are not really comparable because one used 4K tokens and another used 200K tokens with heavy tool use and context compression. GRPO sees the final outcome. It does not naturally know which step actually mattered. That creates high variance. In short tasks, group comparison works well. In long tasks, group sampling can collapse into two bad cases: 1. All samples fail The whole expensive rollout gives almost no useful training signal. 2. Only one sample succeeds That single success may be luck, but GRPO may treat it as a strong positive signal and over-reward the trajectory. Both are dangerous for long agentic training. This is where PPO’s critic becomes valuable again. A value model can learn expected value under noisy states. It can provide denser feedback before the full rollout ends. It is more expensive, but it helps with long-horizon credit assignment. So the author’s view is: GRPO is not being rejected because it was wrong. It is being outgrown by the task format. For short, deterministic, verifiable tasks, GRPO remains strong. For long, noisy, tool-heavy agentic tasks, PPO-style value modeling may simply be the better fit. The “compaction problem” mentioned around long contexts is likely more of a symptom. The deeper issue is that GRPO’s weaknesses become costly when trajectories are long and states keep changing. Could GRPO still work? Yes, if paired with a strong Process Reward Model. The author points out that DeepSeek MathV2 uses this direction. Process-level signals can help fix GRPO’s sparse-reward weakness. But without that, returning to PPO makes sense. 🎯The bigger takeaway: GRPO saved the value model. PPO brings it back. GRPO’s main advantage was efficiency. It removed the critic and saved resources. But for long-horizon agentic tasks, the critic’s ability to generalize and assign credit may be worth the cost again. In the Agent era, RL for LLMs is becoming less like solving a short math problem and more like training an agent to play a long, noisy game. And for that world, value models may still be the soul of RL. 🔗Full Reading (CN): https://t.co/hf1GsDBc3e

ZhihuFrontier's tweet photo. Why Would GLM-5.2 Move Away From GRPO?
🌟Insights from Zhihu contributor 九老师

TL;DR: GLM-5.2 dropping GRPO does not mean GRPO is “bad.” It means the assumptions that made GRPO attractive for short LLM RL tasks may no longer hold for long-horizon agentic tasks. When rollouts get longer, environments get noisier, and credit assignment gets harder, PPO + value modeling starts looking useful again.

The key question is not simply “why did GLM-5.2 stop using GRPO?” A better question is: why did GRPO become useful for LLM RL in the first place?
If the reasons that made GRPO attractive no longer hold, then going back to PPO becomes natural.
GRPO can be understood as a sampled-baseline method. Instead of training a separate value model, it samples multiple responses for the same prompt and uses the group average as a baseline.
That is elegant. You get a relative reward signal without paying for a separate critic. In short tasks, this is very appealing.

But there is a tradeoff.⚖️
PPO uses a learned value function, or critic. This critic is expensive and harder to tune. It also has its own problems: the policy keeps changing, so the value model is always trying to follow a moving target. That can introduce bias.
GRPO avoids that by using an up-to-date sampled baseline. It is closer to low-bias, but it tends to have higher variance.
For early LLM RL tasks, that tradeoff made sense:
• Rollouts were short
• Final rewards were clear
• Memory savings mattered a lot
• Multiple samples per prompt were manageable
• Math/code tasks were relatively easy to verify
That is why GRPO worked so well for many short, verifiable reasoning tasks.

But long-horizon agentic tasks change the game. 🎮
A long agent task can look much more like a game environment:
• Many steps
• Tool calls
• Partial progress
• Delayed failure
• Noisy observations
• Intermediate rewards
• Wrong action penalties
• Context compression
• Different paths to the same final answer
This is where GRPO starts to struggle.

The biggest issue is credit assignment. In GRPO, the final reward is applied broadly across the whole trajectory. If a task succeeds, many tokens get rewarded. If it fails, many tokens get punished.
But in a long task, that is too coarse.

Maybe the first half was bad, but the final recovery was good. Maybe one tool call at step 30 caused failure at step 100. Maybe two successful trajectories are not really comparable because one used 4K tokens and another used 200K tokens with heavy tool use and context compression.

GRPO sees the final outcome. It does not naturally know which step actually mattered.
That creates high variance.
In short tasks, group comparison works well. In long tasks, group sampling can collapse into two bad cases:
1. All samples fail
The whole expensive rollout gives almost no useful training signal.
2. Only one sample succeeds
That single success may be luck, but GRPO may treat it as a strong positive signal and over-reward the trajectory.
Both are dangerous for long agentic training.
This is where PPO’s critic becomes valuable again. A value model can learn expected value under noisy states. It can provide denser feedback before the full rollout ends. It is more expensive, but it helps with long-horizon credit assignment.
So the author’s view is: GRPO is not being rejected because it was wrong. It is being outgrown by the task format.

For short, deterministic, verifiable tasks, GRPO remains strong.
For long, noisy, tool-heavy agentic tasks, PPO-style value modeling may simply be the better fit.

The “compaction problem” mentioned around long contexts is likely more of a symptom. The deeper issue is that GRPO’s weaknesses become costly when trajectories are long and states keep changing.
Could GRPO still work? Yes, if paired with a strong Process Reward Model. The author points out that DeepSeek MathV2 uses this direction. Process-level signals can help fix GRPO’s sparse-reward weakness.

But without that, returning to PPO makes sense.
🎯The bigger takeaway:
GRPO saved the value model. PPO brings it back.
GRPO’s main advantage was efficiency. It removed the critic and saved resources. But for long-horizon agentic tasks, the critic’s ability to generalize and assign credit may be worth the cost again.
In the Agent era, RL for LLMs is becoming less like solving a short math problem and more like training an agent to play a long, noisy game.
And for that world, value models may still be the soul of RL.

🔗Full Reading (CN):
https://t.co/hf1GsDBc3e

798

106

263K

ArtiIntelligent retweeted

Jetha Chan

@jetha

5 days ago

https://t.co/koqiQpaEx1

165

856K

Artificially Intelligent

@ArtiIntelligent

4 days ago

These guys implemented "Virtual-head padding" to make TP==3, tensor-parallel for three nodes!!! omg...

Tech2Wild

@Tech2Wild

7 days ago

3 x DGX Spark Owners Enjoy ! MiMo V2.5 Omni on 3x DGX Spark, no switch: 🧠 1M context 👁️ Full omni: text, image, video, audio ⚡ TP=3 + MTP, ~39 tok/s 🏆 97.3 quality (our #1 model) Full recipe + every benchmark, reproducible: https://t.co/lpYztfumW5

162

Artificially Intelligent

@ArtiIntelligent

4 days ago

@BryanMcNamaraUS thank you for sharing!

228

ArtiIntelligent retweeted

Siddhartha Saxena

@siddsax

5 days ago

BTS of the day Sam Altman has been waiting his whole life

612

173

92K

Artificially Intelligent

@ArtiIntelligent

5 days ago

RT @juliarturc: This is what happens when you plug LLMs into voice assistants, instead of a decade of handwritten rules. This video dissec…

ArtiIntelligent retweeted

Xiaomi

@Xiaomi

5 days ago

Xiaomi EV Home Charging Robotic Arm, a seamless, fully automated home charging experience. "Human x Car x Home" smart ecosystem. Remote control, right from your smartphone.

386

112K

Artificially Intelligent

@ArtiIntelligent

6 days ago

@StockSavvyShay you do know that he passed away...

160

Artificially Intelligent

@ArtiIntelligent

7 days ago

@ClankerQueen @nvidia wow, how are you doing the fine-tune? do you have any code to share? results? thanks!

238

ArtiIntelligent retweeted

sarah guo

@saranormous

8 days ago

$INTC CEO @LipBuTan1 in new interview on his Foundry bet, and why we must make semis in the United States.

427

207

99K

Artificially Intelligent

@ArtiIntelligent

7 days ago

When looking at different companies using "AI Agents" - you have to ask yourself, what's their competitive advantage? If everyone is using the same agents, how are their products differentiated? The models are all trained on the same target distribution...

Artificially Intelligent

@ArtiIntelligent

8 days ago

@onusoz are you using the assistant model?

Artificially Intelligent

@ArtiIntelligent

Last Seen Users on Sotwe

Trends for you

Most Popular Users