Since everyone is talking about RL Environments and GRPO now but no one knows how it works we thought it would be cool to make an explainer video + code you can run:
This is an example of using GRPO to train Qwen 2.5 to play 2048 (code in thread) 🧵:
OPENAI: "We also see early signs of recursive self-improvement in today's systems". RSI is "potentially the most consequential frontier safety issue of the coming decade."
@redtachyon > tries to imagine AGI
> fails
> calls everyone else talking about it a moron
yeah guys its not a real thing, it won't happen for 1000 years. pack it up guys.
For some reason property rights exist for atoms but they don't quite exist for photons*.
AI companies exploit this fact as much as they exploit scaling laws.