So what should we do instead? It turns out the simple n-gram models might be a better choice, thanks to incredibly fast inference speeds.
Sometimes, simpler is better!
4/n
Distillation is a common way to improve acceptance rates, but we find that distillation on one task (translation) tends to generalize poorly to another task (story generation) in the language
3/n
the God Model is a useful theoretical construct akin to a Worst-Case Adversary or a Busy Beaver Program or an NP Oracle, less compelling as a target to seek than as a foil for designing minimax programs which can be tangibly realized
@joseph_h_garvin@buildwithparas In my experience they’re actually pretty bad at it—but I think that’s an effect of RLHF and style training; base models are way better
@xboxbodywash My favorite was a student who couldn’t remember how to break out of a loop to end the program, so instead they entered an infinite inner loop
@gabriberton Yup, even if you have task-specific data for distillation. It turns out since the forward pass is OOM faster, you get a favorable speed up even though the acceptance rate is a lot lower
@gabriberton Apple does for some of the on-device tiny models
I also actually just finished a paper (arxiv soon) showing that ngram models consistently work better for rare languages
Most people against this are foreigners who are already here mad there about to loose their jobs to Americans. This will make it more appealing to hire American citizens and American college grads.
The Americans against it are cultural suicidal drags