ICML 2026: Latent Reasoning in TRMs is Secretly a Policy Improvement Operator
Why does recursive reasoning, especially latent reasoning, actually work? The theory is still young, and even mechanistic explanations are limited.
We close part of this gap by showing that latent reasoning is secretly doing policy improvement. Each recursion pushes the model steadily toward the target.
Based on this view, we propose an algorithm that boosts learning and inference efficiency by up to 18x.
Direct Preference Optimization (DPO) is a recent breakthrough in AI alignment that provides a simple alternative to Reinforcement Learning from Human Feedback (RLHF). Instead of first training a reward model and then optimizing a policy with reinforcement learning, DPO directly learns from pairs of human preferences—for example, a preferred response and a rejected one. This eliminates the need for a separate reward model while retaining the ability to align models with human judgments.
Mathematically, DPO can be viewed as optimizing a classification objective derived from a constrained reinforcement learning problem, linking preference learning with probabilistic inference. The method updates the policy to increase the likelihood of preferred outputs while decreasing the likelihood of less desirable ones.
In machine learning, DPO offers an efficient framework for learning from comparative feedback rather than explicit labels. In deep learning, it has become a key technique for aligning large language models, improving helpfulness, safety, and instruction-following behavior. In reinforcement learning, DPO provides a bridge between supervised learning and policy optimization, replacing complex RL pipelines with a more stable optimization objective.
The broader insight is that many real-world tasks are easier to express through preferences than absolute rewards. By learning directly from comparisons, Direct Preference Optimization offers a scalable and mathematically elegant framework for training the next generation of aligned AI systems.
Image: https://t.co/9Ync1ZHNKq
100 Great Problems of Elementary Mathematics
"The collection, drawn from arithmetic, algebra, pure and algebraic geometry and astronomy, is extraordinarily interesting and attractive."
—Mathematical Gazette
Get it here: https://t.co/Q11yhy5HlX
一句话概括几乎所有 “AI + 传统学科”的 paper:所有需要决策的地方都可以用 AI 做决策,即使不创造新算法,也可以把传统经验直接用神经网络拟合
常见关键词/标题包装:Neural [传统方法] for [传统领域]、E2E Differentiable [传统方法]、Data-Driven NN for [传统领域]、Learning-Based [传统方法]