Hereโs are some of the experiments and observations I did as part of the initial testers on the locksmith game using within ARC-AGI-3 (my template is available in the repository) ๐งต
Today, we're announcing a preview of ARC-AGI-3, the Interactive Reasoning Benchmark with the widest gap between easy for humans and hard for AI
Weโre releasing:
* 3 games (environments)
* $10K agent contest
* AI agents API
Starting scores - Frontier AI: 0%, Humans: 100%
Codex and Claude just became a social hangout.
Messages, see which friend is online, chat rooms, leaderboards
good oldโ ICQ / IRC chats vibes , inside Codex and Claude
join -> shellbook . co
Say hi to Poke Ultra! ๐ด
The ultimate Poke experience, with frontier intelligence powering every action.
Also, starting today:
Poke can now create & deploy websites on
https://t.co/mCmmrugt1t!
@adithya_s_k Currently scaled this to 10000+ tasks with some dataset sources.
Trial run is a bitch here. Without non-refusal models half of these are just wastage.
happy to announce RewardHackBench, built on @harborframework , we study if sandboxes can stop agents from cheating on benchmarks
https://t.co/YT4Ibq7tE9
happy to announce RewardHackBench, built on @harborframework , we study if sandboxes can stop agents from cheating on benchmarks
https://t.co/YT4Ibq7tE9