@RhysSullivan Depends on the model usually no but in our benchmarks there are models where it helps a ton but they usually have weaker reasoning capabilities
Human-in-the-loop RL is necessarily done at group size 1; you cannot do a group of rollouts with only one human. i.e. there is no baseline for you to subtract for each input prompt. This is by far the most interesting and under-discussed part of this announcement.
The same was true for their tab-completions model. From the wording in their posts, it sounds like they are using plain REINFORCE (no mention of value functions) with a large batch size + re-evaluating each checkpoint to guard against high variance. Cursor is implicitly revealing an important empirical result: with a large enough batch size, simple REINFORCE just works, no baseline needed. In other words, large scale continual learning is solved.
It finally happened-my personal move 37 or more. I am deeply impressed. The solution is very nice, clean, and feels almost human. While testing new models in the last few weeks, I felt this coming, but it's an eerie feeling to see an algorithm solve a task one has curated for about 20 years. But at least I have gained a tool that understands my idea on par with the top experts in the field. And I am now working on a completely new level. My singularity has just happened… and there is life on the other side, off to infinity!
5 million humanoid robots working 24/7 can build Manhattan in ~6 months. now just imagine what the world looks like when we have 10 billion of them by 2045. now imagine the year 2100.
Introducing Lab: A full-stack platform for training your own agentic models
Build, evaluate and train on your own environments at scale without managing the underlying infrastructure.
Giving everyone their own frontier AI lab.
Introducing Lab: A full-stack platform for training your own agentic models
Build, evaluate and train on your own environments at scale without managing the underlying infrastructure.
Giving everyone their own frontier AI lab.
@thdxr I believe this is regarding intra-turn prefill where you precondition the response by proving the first few tokens for that turn of the assistant response
I believe this is unrelated to prior turns in the chat format