In a world where model performance changes week over week, you have to build eval very early and constantly run evaluation to see if a change is improving your tool accuracy or performance.
Improve prompt -> run evals -> accept or reject changes
Rinse and repeat
I have been looking more deeply at coding agents and spent some time looking at Aider. The biggest takeaway I have today is that all tools are doing only one thing: improving the prompt and the underlying tools fed to the LLM.
https://t.co/1GiZ9vQydW