In AI, there is no active or passive - only ruthless information compression and algorithmic efficiency, stripping perception down to its most fundamental core!
One of the hottest terms in AI right now is "On-policy distillation".
It is a post-training technique in which a student model, typically an LLM, samples from its current policy and receives a teacher signal for on-policy states. It combines the dense supervision of distillation with the locality of online RL.
Now a method on PapersWithCode!
Find all 183 papers that cite it, and more here: https://t.co/NIsUjyU3UP
EVE-Agent argues self-evolving search agents should not train on examples they cannot justify. Data-free self-evolving search agents generate their own questions, answer them, and improve from their own feedback. That scales beautifully without human annotations but also lets the agent reward fluent-but-unsupported examples, turning the self-generated curriculum into an opaque and unreliable training signal.
Each generated instance must include not just an answer but a source-grounded span whose contribution to that answer can be measured.
Mechanism: the proposer generates a question, an answer, AND a verbatim evidence span. An evidence verifier rewards the span according to the marginal accuracy gain when that evidence is provided to the solver. Spans that genuinely raise accuracy get reinforced. Hallucinated or irrelevant spans get filtered out before they pollute the gradient.
This is the same insight as RLVR but applied one level earlier in the self-evolve loop: validate the data BEFORE it becomes a gradient. If you train on agent self-play, this is the architecture nudge worth absorbing immediately.
EVE-Agent: Evidence-Verifiable Self-Evolving Agents
https://t.co/stpPTQum1o