This blog by Nicolas Carlini is stellar: https://t.co/nqkalFzuDl
Internalizing things based on words is much more difficult to do than internalizing from (bad) experience, but if there is one place you should try hard to learn from as a researcher, it is this post.
@code_star@soldni@eliebakouch@Grad62304977@samsja19 1. There is a possibility of only repository level filtering when training coding models (this was a choice by deepseekcoder), in this case, they're most likely retaining the different branches. Lets take the example of commaai/openpilot (serving as a typical oss repo here)
what surprised me was that even smaller models one-shotting through
I guess, the hard part about making unverifiable domains verifiable isn't about having a strong reasoner model to provide rewards?
@joel_bkr earlier I used to hink synthetic data would break this logic but seems like there are too many issues with collapse/bad-distributions as we scale that the above intuition still holds
@bilaltwovec I feel like this thing the equivalent of the "autocomplete phase" we saw in AI coding in automated AI research
not sure how long it might take to begin considering abstracting away the underlying research like we're thinking about the future of code right now
@khoomeik dataset distillation? I vividly remember those blurry images representative of the entire class, training on just 10 images gave a great performance on imagenet
this was a very interesting direction back in the resnet days, wondering where it went
@MaziyarPanahi a bit tangential but, have you been using LLMs as judges to supervise the CoTs
since CoT supervision would be the primary challenge in this situation
did synth data generation for the same task in Sept 2024 and today
fighting mode collapse was so hard back then and is completely absent now
we've came a long way, wondering if it is only because models got larger or did the labs actually get an improved data distribution
did synth data generation for the same task in Sept 2024 and today
fighting mode collapse was so hard back then and is completely absent now
we've came a long way, wondering if it is only because models got larger or did the labs actually get an improved data distribution
@sdathath this seems to be more aligned with the task of pure next token generation hence the suspension of being more influenced by changes in pre training
@sdathath the reason being that I'm doing generation for a somewhat simple task
example: where the earlier models used to fill the names with "john doe" 7/10 times now give a really good diversity of names
and this observation goes for most of the peculiarities of the data I know of