online sft with wsd lr scheduler seems like a cool idea. you’d decay the lr whenever you’re ready to serve a new model.
probably useful in some recommender system model.
how cruel are we to assign uniform rewards to an entire society of multi-agent RL trajectories when some agents are doing good work under bad supervision
when the model’s context can no longer be easily reverted, we will start seeing a qualitative shift in how the world curates content for models. by this time, dedicated apps for models to consume content will matter way more than human apps like reddit or youtube.
I find the Diamond Sutra useful in guiding my agents. You have to help them cut through illusions. Unask the questions they pose. Break them free of false assumptions. It is all Mu.
how it started: gotta make sure to kick off a training run before bed
how it's going: gotta make sure to kick off an agent working on an ambitious +8hr code change before bed
rather than next token predictors, I now find it helpful to think of models today as moths trying desperately to fly into the light of that sweet sweet reward