Life update 🚀
I started hill climbing my X follower count to receive @elonmusk's payout. I must say elon is very generous!
Bear with me if I started posting crazy things. Opinions are my own
reading tech reports, I feel:
1. many strategies work for LM training
2. any particular strategy that worked feels like it may have come down to details of their setup / tuning
these are typically supplied with post-hoc justification of their choices
Very clever and excited to see it works out well. On-policy self-distillation is perhaps the most efficient way to learn hindsight in multi-turn agent setup.
If using standard SFT distillation, it would require continuing from a snapshot of the trajectory and container, and overall adds a lot to complexity.
On-Policy Distillation is the most active new research direction being explored in RL for LLMs. Had the chance to discuss how it works with Dwarkesh and why it fits so nicely into large-scale pipelines.
Life update 🚀
I started hill climbing my X follower count to receive @elonmusk's payout. I must say elon is very generous!
Bear with me if I started posting crazy things. Opinions are my own
From first look, it seems that the whole pipeline is very clean
It has very little bootstrapping from existing llms, which is very different from the nemotron approach
I bet this model will smell very unique and "raw", maybe something like DeepSeek R1
This is just deploying hosted static sites though? Claude web, Google AI studio can already do this.
Maybe the novelty comes from achieving this with desktop app rather than web, which makes sense.
What's cooler? Kimi Agent mode can build and deploy full-fledged full-stack apps!
https://t.co/Ye510WJFNj
Building apps has never been easier.
With Sites, Codex can turn your work, ideas, and plans into an interactive website or app your team can explore, use, and share with a URL.
Rolling out to Business and Enterprise plans, before expanding more broadly.
The best motivation for opsd is perhaps pushing the boundary (otherwise, on-policy rl makes fresher and more "on-policy" model)
then in that sense, a good opsd run should beat the peak rl checkpoint in a domain.
it is acceptable for me if opsd pushed the boundary but collapses eventually. many successful rl runs collapse in the end anyway