Introducing Horizon from @0rinlabs: the first long-horizon learning benchmark made from real agent logs
- SOTA is 21% on the hardest section
- 7-35M tokens of real agent history per task
- Models are hardly getting better on the hardest tasks
- Humans can score 100%
(1/7)