This SkillOpt paper from Microsoft is a must-read!
(bookmark it)
I was a bit skeptical of the results reported in the paper when I shared it a few days ago.
However, I managed to integrate it into my agent orchestrator and ran a few experiments.
The results are mindblowing.
Essentially, all my agent skills now have a proper testing framework and a way to self-evolve. I have started to improve all my agent skills with this.
One exciting result was when I applied it to my paper-figure-extraction skill, which requires an agent to do multimodal analysis. In particular, it improved quality by +20 points (0.73 → 0.93). I went to see the extracted tables and figures, and I was absolutely stunned by how much better my skill got at the task.
Self-improving AI is in the early days, but I think this work is a clear example of the current ability of agents to self-improve.
In this case, it was skills, but it's not hard to imagine how this scales to optimizing agent patterns, tool use, context engineering efforts, agentic search, workflows, evals, and even the harness itself. I already started with a few of these ideas inspired by SkillOpt.
Stay tuned!
A pattern I'm seeing with AI debugging: it's easy to get stuck inside the model's search space.
So you burn tokens & time chasing candidate fixes, while the real answer sits in context only you have ... but never explored, because you quietly surrendered your thinking.
ハーネスで性能が変動するなら、RAG手法の検討をするときはハーネス横断的に評価するべき
評価に使ってるベクトル検索はだいぶ古典的だし全体的に網羅性はあまりないけど、それでも重要な指摘ではあるなと
エージェント時代のRAG評価【Is Grep All You Need?】
https://t.co/vEFcAJh3R3
ハーネスで性能が変動するなら、RAG手法の検討をするときはハーネス横断的に評価するべき
評価に使ってるベクトル検索はだいぶ古典的だし全体的に網羅性はあまりないけど、それでも重要な指摘ではあるなと
エージェント時代のRAG評価【Is Grep All You Need?】
https://t.co/vEFcAJh3R3
You asked for it, so here it is: a deep-dive on my new /handoff skill.
It's an alternative to /compact that gives you WAY more flexibility with your context window.
- Think of an idea, handoff to another agent to implement
- Grill, handoff to prototype, handoff BACK
Enjoy: