Most agent evals are over in minutes.
I wanted one that tests whether agents can stay useful for hours, adapt through phase changes, and recover from bad strategy.
So I made a Harbor benchmark where agents play Universal Paperclips until they convert the universe into
paperclips.
No source access, hidden state, or JS eval. Best run: GPT-5.5 xhigh, 7h39m.
ChatGPT 5.5 Thinking (Extended) and the new image generation model cooks.
I wanted to caramelize onions faster. Prompted chem lab techniques. It delivered.
Use Opus 4.7 on high or higher. Adaptive takes you down misleading rabbit holes and enraging debugging sessions that cost more tokens than starting on high+ in the first place.
type a vibe. get a lo-fi track with a movie quote built in.
built notbumblebee for ElevenHacks @turbopuffer x @ElevenLabs : hybrid vector search finds the perfect dialogue from 10K movie clips, then ElevenLabs Music API composes a full track around it.
@PlotWeaver Dependent on population density. In the city, 0 is most common. Suburbs 1-2. Rural, as many as you can afford.
Enthusiast would be 3: handgun, rifle, shotgun