MIT just released a new RL method called Pedagogical RL.
The main lesson -> correct reasoning traces can still be bad training data.
It is a similar concept to teaching someone backprop.
Say you have a tiny computation graph:
z = wx + b
a = ReLU(z)
L = (a - y)²
If you already understand backprop, you can jump straight to the gradient:
dL/dw = 2(a - y) · 1[z > 0] · x
The answer is correct but it skips the reasoning process.
To get there, you need to break the computation into local pieces:
dL/da = 2(a - y)
da/dz = 1[z > 0]
dz/dw = x
Then backprop is just composing those local derivatives backward through the graph:
dL/dw = dL/da · da/dz · dz/dw = 2(a - y) · 1[z > 0] · x
Showing a student the final gradient does not teach them how to find gradients on new graphs.
Even telling them “just use the chain rule” may be too large of a jump if they do not understand how to decompose the computation into intermediate nodes and local derivatives.
Reasoning RL has the same failure mode.
A rollout can pass the verifier while containing one step the student model basically never would have taken.
The trajectory gets the answer right, but the learning signal is brittle because the path is too far from the student’s current policy.
Pedagogical RL trains a privileged teacher that knows the answer, then rewards it for producing trajectories that stay learnable for the student.
The trick is to use a spike-aware reward. It penalizes single huge surprise gaps in the trajectory, even when the average likelihood of the trajectory looks fine.
Then the student learns with surprisal-gated imitation, where teacher tokens that are still too surprising get downweighted.
The teacher is learning how to teach at the student’s current level.
Pedagogical RL makes RL more efficient by efficiently selecting trajectories the student is most ready to learn from.
Less waiting for the model to get lucky rollouts. More training signal from examples that meet the student where it is.
Full blog in comments
🗓️ Today marks the 80th day of #Iran's internet blackout, with the shutdown passing 1896 hours. Meanwhile, pro-regime content floods social media, as Iranians seeking to get pro/whitelist access say they are being asked to meet a quota of daily propaganda posts, policed by AI.
🌍 #Iran's digital isolation is now entering its 77th day as the internet blackout passes 1824 hours. The measure presents an emerging mental health risk to the public, who are largely cut off from online platforms, communications, and normal interaction with the outside world.
💸 #Iran's internet blackout is now past hour 1776, entering its 75th day. The digital censorship measure has led to profiteering and a decline in digital security as government-backed "pro" internet schemes and selective whitelisting result in surveillance, corruption and scams.
This is a great report that provides a thoughtful, detailed and very well researched description of the risks of AI. It is essential reading for anyone who wants to write or talk about AI risks.
Today is International Mother’s Day.
This is a photo of my mother at her only son’s grave, who the Islamic Regime of Iran shot dead only 4 months ago on January 8!
How can I congratulate this day to a woman who recently lost her son to the terrorist act of the Islamic Regime?
Don’t stop talking about Iran!
#IranMassacre
#سینا_کاظمی
70 days with NO INTERNET in IRAN!!!!
70 days with NO INTERNET in IRAN!!!!
70 days with NO INTERNET in IRAN!!!!
70 days with NO INTERNET in IRAN!!!!
70 days with NO INTERNET in IRAN!!!!
70 days with NO INTERNET in IRAN!!!!
70 days with NO INTERNET in IRAN!!!!
70 days with NO INTERNET in IRAN!!!!
70 days with NO INTERNET in IRAN!!!!
WTF???? FREE IRAN ♥️💚🤍
SECRETARY RUBIO: The people of Iran are daily victims of the regime and the President has deep sympathy for what they're going through. This is a country that shouldn’t be isolated, as the people are phenomenal.
.@SecRubio on the Iranian people: "I don't know of any country in the world where there's a bigger difference between the people and the people who run the country."
To our Arab neighbors: our region is at a fork in the road.
One path leaves the Islamic Republic in power and leads to further crimes against our people and yours.
The other helps Iranians reclaim Iran from this illegitimate regime and returns peace and stability to the region.
⌛️ Metrics show the #Iran internet blackout has entered its 65th day after 1536 hours amid growing concern over human rights situation in the country.
While whitelisting and privileged access are in place for a select few the general public remain cut off from the outside world.
The suffering inflicted by the Islamic regime is a wound that deepens daily. What makes this tragedy unbearable is the hypocritical approach of so-called free democratic nations that continue to support the dictatorship through their policies.
#DigitalBlackOutIran
👨👩👧👦 The internet blackout in #Iran is now in its 63rd day after 1488 hours.
The extended censorship measure presents a barrier for friends and families abroad checking in on loved ones' safety and wellbeing.
A German reporter asked the crown prince if he is an Israeli Asset.
Reza Pahlavi gently reminded the useless reporter that Iran saved Jews while the Germans were burning Jews in ovens.
"Miljoenen Iraniërs zijn monddood gemaakt. En [...] dit monddood maken gebeurt niet alleen door het Iraanse regime; maar ook door de internationale - en in het bijzonder - Europese media"
Onze eigen @NOS bracht klakkeloos een filmpje van Iraniërs die breien en volleyballen 🚮