you can train lots of insanely cool things
you can train it to play a Minecraft sim, it collects resources and crafts items
@puffer_ai is just insanely cool
Seeing a number of benchmarks showing Opus is the best model for long-running work.
Five tips for running Opus autonomously for hours/days:
1. Use auto mode for permissions, so Claude doesn’t ask for approval
2. Use dynamic workflows, to have Claude orchestrate hundreds/thousands of agents to get a task done
3. Use /goal or /loop, to nudge Claude to keep going until it’s done
4. Use Claude Code in the cloud, so you can close your laptop (easiest way is the desktop or mobile app)
5. Make sure Claude has a way to self-verify its work end to end: Claude in Chrome browser extension for web, iOS/Android sim MCP for mobile, a way to start the full web server or service for backend work
you can train lots of insanely cool things
you can train it to play a Minecraft sim, it collects resources and crafts items
@puffer_ai is just insanely cool
Can coding agents stay coherent over a 1 billion token budget?
Can they build Slack from scratch?
Rewrite a JAX codebase in PyTorch?
Build a C compiler in Rust?
Enter SWE-Marathon: a benchmark for autonomous long-horizon software work.
this isn’t talked about enough cause a lot of people are hoping the big labs will solve this and it will be available to them downstream.
in my opinion there will be a lot of different styles of continual learning (in ttt, opd, etc). different problems would require different styles implemented.
In Agent RL, models suffer from Template Collapse.
They generate vast, diverse outputs (High Entropy) that lose all meaningful connection to the input prompt (Low Mutual Information).
In other words, agent learn different ways to say nothing.
🚀 Introducing RAGEN-v2 -- Here's how we define and fix such silent failure modes in Agent RL. 🧵
done. cool bit: interpolating between RL checkpoints souping (didn't know this was a thing before). Also they show you can extrapolate beyond what training ever reached.
I was hoping they had some results on the divergence between the souped checkpoints and the actual ones.
🧵 For 2 RL checkpoints trained differently, you can just weight extrapolate them and it works!
Bonus: these extrapolated checkpoints are complementary policies
-> Get exploration and diversity for free
-> Better inference scaling when ensembling
Paper: https://t.co/zU0LH0TOdm