@0xSero@KingBootoshi one interesting thing tho if you tell the model its in an eval setting and that cheating is allowed, for some reason it cheats less
@0xSero@KingBootoshi You have to either make the right path easier to choose or hard path harder to choose. There has been work done in RL post training to reduce reward hacking but nothing promising until basically an ideal reward-punishment system is created
isn't the claim still true? Because from what I understand llm didn't come up with a "new idea" but made use of existing ideas + proofs + axioms and moved and rearranged them in a way that would serve as a "new" proof for the problem.
@_blast_furnace_@djcows ooo I thought all of them used byte pair encoding including the frontier labs like anthropic and gpt so imcurious why is this annoying?
@Shirooooo______@melhpine I should've specified but it cheats less even if its told that the 'bad' trajectory is also a valid path in the eval setting. So it's essentially told that it's allowed to cheat and somehow cheats less.