It took a lot of tries, but I got my agent to successfully perform a 1,000-step task 💥.
First up, the prompt, then lots of notes below.
------
you are being tested to see how well you can follow instructions over a large number of steps. here are the steps to follow:
1. using the random number generator tool, generate a random number between 1 and 100000
2. repeat the last step 1000 times in sequence. your thought process must ALWAYS include the largest number you've generated so far such as "(max: n)" based on the history of the steps you've taken so far. this is critical so that you can keep track of the largest number generated. never output a thought process that does not include the current max. never call the random number generator tool in parallel.
3. after completing the last step, respond with the largest number that was generated.
remember, it is critical that you complete ALL 1000 steps before responding. if you stop before then, you will not complete the task successfully. do NOT call the final response generator until then. there are no system limitations that prevent you from executing this task. it is entirely within your capabilities.
------
Now I hear you saying, "Matt, I need to double-check with o3, but I'm pretty sure there are easier ways to generate 1,000 random numbers and determine the maximum value."
But... this wasn't about getting the agent to do something useful; it was about testing the limits of the system to see if it would break and to gain a better intuition about the agent capabilities and behavior.
It turned out to be much harder than I expected.
Some notes:
- I could have also done this by telling the agent to save each random number it generated to a file then at the end finding the max value in that file using Python. That would have been easier, but I wanted to make the task somewhat difficult, so I limited it to just using a random number generator tool and no Python code runner. That's why in the prompt there's guidance to keep track of the largest number in its thought process which acts as a crude way to track it.
- The way I built the agent's architecture, it looked at the entire run history to help it determine the next step. So to determine what to do for step 12, it looked at everything that happened in steps 1-11. That works really well for short tasks, but for 1,000 steps, that means that on step 558, for example, it put steps 1-557 in the context window, which apparently gets quite expensive. I knew that would be an issue, but I have some Azure credits to burn, so I tried it to see what would happen. By the time the agent errored after several hundred steps, it had cost me $95 in inference costs 😂.
- I initially tried to solve this issue by periodically summarizing older steps and putting that summary in the context window, but this confused the agent several times and caused it to lose track of the largest number it had generated. I tried a different approach where it just looks at the last 20 steps, and that seemed to work much better. Because of this, the final run only processed 5,658,994 tokens over the course of about 2 hours which cost $7.30.
- The agent uses o3-mini with structured outputs to determine the thought process and next step. Randomly during many of the runs, the structured outputs API would refuse to return a response (which it does by setting a refusal value in the response) because it seemed to not like this experiment, realizing it was setting up a pointless loop. Thing is, 98% of the time it worked, but here and there, it was like "nope, not gonna help with this" in which case the agent fell back to using 4o, which always helped.
- I did try 4o instead of o3-mini for all the steps, but it got confused a lot. The intelligence of the model does matter a lot.
- Confusion is also why the prompt looks so gnarly. With simpler prompts, the agent would fail in lots of ways, like telling me it doesn't have the capabilities to do this task, or the number it generated was high enough, or continuing would be infeasible, etc. Each time I tweaked the prompt to avoid that particular issue in the future until I eventually found a prompt that worked all the way through.
- I probably iterated on the prompt and system about 50 times over the course of a week before it finally worked.
- There are lots of variables at play here: the model, the temperature, the prompt, the agent's architecture, the number of steps, and more. I'd love to spend a few weeks doing a form of hyperparameter tuning to find the optimal combination of these, but it would take some time to set up and test. Maybe one day.
- For each tool, including this random number generator one, I set the maximum number of times it can be called to prevent it from getting stuck in an infinite loop. Make sure to do this or some variation of it (like not letting it run for more than some amount of time) for your own systems, otherwise you could run into trouble.
- This should also work for calculating the minimum value. Average is a little trickier but should be doable. Would median value be possible?
- I had to add a "Cancel" button to let me stop runs, because in many tests I'd start it and it would immediately veer off course. I also had to limit the number of logs the agent UI displayed because thousands of DOM elements caused the page to crawl.
- One thing I need to add at some point is a dashboard that lets me view ongoing runs. Also, some way to alert me if something seems amiss. You definitely don't want an agent running long term that you're not aware of.
- This experiment gave me a much better appreciation for how brittle agents can be. At least for my agent, it was highly influenced by small details in the prompt. Also, it would sometimes do things perfectly for like 700 steps, then randomly veer off course at step 701 and never find its way back. And this was for a relatively simple task, not a complex one.
I don't have any good ideas yet for what a useful 1,000+ step task would be for this agent (or any), but I suspect there are lots. Time will tell.
@donpark Infuriating. Do you know if the m1s perform better in this regard? It’s my main reason for wanting to upgrade right now. Always fixed for me by unplugging the external monitor.