Daniel Chyan @Daniel_Chyan - Twitter Profile

Most agent evals are over in minutes. I wanted one that tests whether agents can stay useful for hours, adapt through phase changes, and recover from bad strategy. So I made a Harbor benchmark where agents play Universal Paperclips until they convert the universe into paperclips. No source access, hidden state, or JS eval. Best run: GPT-5.5 xhigh, 7h39m.

Daniel_Chyan's tweet photo. Most agent evals are over in minutes.

I wanted one that tests whether agents can stay useful for hours, adapt through phase changes, and recover from bad strategy.

So I made a Harbor benchmark where agents play Universal Paperclips until they convert the universe into
paperclips.

No source access, hidden state, or JS eval. Best run: GPT-5.5 xhigh, 7h39m.

1

0

41

Daniel Chyan

@Daniel_Chyan

about 2 months ago

@innoutburger_ Light well is the best

0

12

Daniel Chyan

@Daniel_Chyan

about 2 months ago

Thoroughly enjoying gpt-5.5 in chatgpt Try out Thinking + Extended. Great brainstorming partner

0

35

Daniel Chyan

@Daniel_Chyan

about 2 months ago

@OpenRouter is great future proofing. Can A/B test opus 4.7, gpt 5.5, and GLM 5.1

0

34

Daniel Chyan

@Daniel_Chyan

about 2 months ago

ChatGPT 5.5 Thinking (Extended) and the new image generation model cooks. I wanted to caramelize onions faster. Prompted chem lab techniques. It delivered.

Daniel_Chyan's tweet photo. ChatGPT 5.5 Thinking (Extended) and the new image generation model cooks.

I wanted to caramelize onions faster. Prompted chem lab techniques. It delivered. https://t.co/yZilGQ8nh9

0

83

Daniel Chyan

@Daniel_Chyan

2 months ago

@andrewchen Unfortunate for LA

0

38

Daniel Chyan

@Daniel_Chyan

2 months ago

Why Opus 4.7 Adaptive is way better in Anthropic's apps than in Openclaw's harness. Openclaw's system prompts are quite harsh. https://t.co/KwPev08LO2

Daniel_Chyan's tweet photo. Why Opus 4.7 Adaptive is way better in Anthropic's apps than in Openclaw's harness. Openclaw's system prompts are quite harsh.
https://t.co/KwPev08LO2

0

41

Daniel Chyan

@Daniel_Chyan

2 months ago

Use Opus 4.7 on high or higher. Adaptive takes you down misleading rabbit holes and enraging debugging sessions that cost more tokens than starting on high+ in the first place.

0

1

0

43

Daniel Chyan

@Daniel_Chyan

2 months ago

@innoutburger_ Animal style remains the same- whew

1

3

0

637

Daniel Chyan

@Daniel_Chyan

2 months ago

type a vibe. get a lo-fi track with a movie quote built in. built notbumblebee for ElevenHacks @turbopuffer x @ElevenLabs : hybrid vector search finds the perfect dialogue from 10K movie clips, then ElevenLabs Music API composes a full track around it.

0

88

Daniel Chyan

@Daniel_Chyan

2 months ago

@annimaniac Glad to hear we’re in good company! Slight variation from us: Gemini is our review agent

0

292

Daniel Chyan

@Daniel_Chyan

3 months ago

@PlotWeaver Dependent on population density. In the city, 0 is most common. Suburbs 1-2. Rural, as many as you can afford. Enthusiast would be 3: handgun, rifle, shotgun

0

1

0

12