4. Unfortunately the best way to build with open weight models is still openrouter. The chat completions API/models/provider tools end is very jagged and difficult to work with if you expect portability across different vendors.
A few things i learned building sunnyday in the last couple of weeks:
1. Open weight/Chinese models are still hit and miss. Deepseek V4 Pro behaves most like anthropic and openai models. The rest are hit and miss, specifically around tool call behavior.
(cont.) Evals are not good for that because once it goes off the rails it goes off the rails. Then you end up with tons of guardrails in the task definition which is tedious because it's a loop-a-guard. There must be a better way.
Now that I'm constantly prompting AI, I am noticing how details are omitted from what I was thinking versus what I typed. A recent example: I'm thinking "reduce the width of the handle" becomes "reduce the handle". Brain farts? Maybe. Interesting nonetheless.
Shortcuts are not inherently bad. Ozempic is kinda like a shortcut for weight loss. I think the question is what do you do with it once you get to the outcome you want. That’s the values question that we are not talking about.
What is clear to me is that what AI provides is a shortcut. Whether the shortcut is the right thing to take depends a lot on what the goal is. If we recognize it as such as treat it as such, then things can still feel good in the end.
I love possibilities, but I also care that things are good. Good can be achieved when we—the people who are involved in and affected by new possibilities—spend time poring over them and iterating on them. Extending our care and exercising taste.
A lot of the discourse these days conflates what’s possible with what’s good. That’s understandable. Possibilities are exciting; they let us do things we couldn’t before, start new projects, and feel squarely in the driver’s seat.
But because it is possible doesn’t mean that it is good. Too often, we just let things happen. Our judgement gets fatigued when it takes more effort to understand and decide if they are worth our time and attention. And then we let them slip. The word slop encodes that energy.
Ever gotten a great Claude result you couldn’t reproduce?
Saving the prompt doesn’t fix it. The output depends on conditions around it — the work is knowing which ones steer it.
I’m building repeatable, reliable agents. Why: https://t.co/Ts69PFbCBN
Building Milestone! I've been building Sunnyday (trying to build a hosted Agent API) and I've got the first end-to-end demo working. Uses the axle agent framework, networking, sandbox execution, and artifact creation. So gratifying!