Status update: I've been on/off AI agents in the last few days and it is a verifiable truth that every day I didn't use agents, I was more productive. I still attribute that to how slow they are, and my own inability to multi-task efficiently. The magic is there but the slowness doesn't let it cross the threshold where they actually make me faster, and I still dislike the whole thinking paradigm.
About Bend2: honestly, the C/Metal compiler codebase is a clusterfuck right now. I regret letting AI agents write it. All tests pass, and GPU performance is mind-blowing, so the core architecture works. Yet, it has a LOT of bugs. Anything not covered by the tests is a coin toss. This is actually impressive, because, in many parts of the codebase, the right solution was actually the simplest one, yet, the agents STILL managed to find a way to make it work just for the tests. The level of reward hack these agents output is actually impressive I can't even be mad.
It is also ironical because that's the very problem that Bend's proof system was supposed to solve, but Bend is in TypeScript, not in Bend. I'm disappointed I didn't write Bend in itself, and now I feel an immense urge to do so. But the clock is ticking . . .
Still, I do not think Bend is worth launching without the GPU compiler being solid, because the closest competitor, Lean, is actually extremely good, so we need a big differential. Yet, due to the very nature of the project, it would be embarrassing to have bugs at launch.
Regarding AI, I now believe using current gen AI agents in production codebase is harmful and a massive mistake. That doesn't mean no agents at all, but agents work best when they don't touch critical code. Debugging, researching, providing insights, scripts / tools, or anything that doesn't touch code you will maintain in the long term. But if you merge AI code without reading, you're going to have a bad time. Speaking from experience
I'm working 10h/day on SupGen and the remaining time on Bend2
Vibe coded a fix for my broken sodastream in a few hours with Claude Desktop and the Autodesk Fusion 360 MCP.
Had to steer Opus 4.7 a lot but I didn’t touch the CAD at all.
The fix required some creative design because I couldn’t un-crimp the brass stopper from the wire so we did a clamshell screw thread with a retainer ring to snap fit around it.
The clamshell was Claude’s suggestion, but it kept overcomplicating the design and going down rabbit holes that weren’t gonna work.
The right level of promoting seems to be precise instructions with annotated pictures. Claude can handle measurements/orientation well, but messes up the order operations, uses the wrong tool, or hallucinates the result of its actions.
I’m impressed, but also realise just how little of the physical world these models understand and how clunky the OODA loop with this kind of problem is.
Still, great unlock for simple designs like this!
waiting-driven-development used to mean waiting for someone to build the thing, now it means waiting for models to get good enough to one-shot the thing.
Claim: gpt-5-pro can prove new interesting mathematics.
Proof: I took a convex optimization paper with a clean open problem in it and asked gpt-5-pro to work on it. It proved a better bound than what is in the paper, and I checked the proof it's correct.
Details below.
PSA if you have a downgrade scheduled for Claude, you have to cancel it before you can downgrade to a different tier. It just no-ops if you try to adjust plan after scheduling a downgrade.
The big unlock with GPT-5 is latency. I love Claude Code, but keep coming back to Cursor + GPT-5 because of how fast it responds: it keeps me in flow vs context switching while waiting for Claude.
Excited for this! I've tried extensive prompting to get Claude to be a socratic tutor, but a few minutes in it starts to give away answers as hints. The only reliable strategy is to append a socratic reminder to every prompt.
We launched a Claude Code learning mode!
Claude Code not only makes you more productive, but can now also help you get better at coding.
Whether you're a CS student or seasoned programmer, it will push you to think deeper about the code you're generating.
What's a good verb for replacing SaaS/subscriptions with local/self-hosted, purpose-built, DIY software? I've been saying subkill but probably something more catchy has been coined.
Now that I have (pretty accurate!) token counting in my statusline, I'm trying to see when the API actually stops my session. With auto-compact off, it starts warning 0% context around 160k tokens (80% context)
It died at 191.6k (95%)
input length and `max_tokens` exceed context limit: 198149 + 21333 > 200000
It seems they preserve a generous 40k tokens for compaction - in my profiling its usually <20k
If you want the extra headroom/to take your chances with not being able to continue the conversation, turn off auto-compact :)
Prompt:
Tutor me in <topic> with the Socratic method. Ask short, simple questions that make me think as much as possible, and don't give away any information. Challenge my assumptions, and lead me down a path to deriving the understanding. Ask me to try harder if my answers are weak. Only give clues if I ask for them.
GPT-5 is the first model to give a good experience experience of the Socratic method. It adapts well to instructions and maintains a consistent character throughout the interaction.