IMO, unsupervised coding agents are still in the "uncanny valley" of software development.
Given the recent hype of AI writing "a browser with 1 million lines of code from scratch", I have been trying to give it more open-ended tasks, and every time it delivered seemingly working software but with large flaws in the implementation.
Last week it was a "download manager in Rust" that completely failed at concurrent downloads. This week was a port of the Ryu algorithm for pretty-printing floating points with different precisions (f32/f16/bf16/f8). Both times the software seemed to work but had deal breaker bugs lurking. Fixing those required me to become part of the loop and supervise the agent.
Here is a breakdown of the latest experiment, for those who may be interested. TL;DR: https://t.co/bIHXXjHMZh (it says 66k lines added, but 65.5k is a generated fixture file, so the diff is more like +800/-300).
---
The goal was to implement pretty-printing of floating points (f32/f16/bf16/f8) in Numerical Elixir. I created a blank repository, wrote the problem statement, and mentioned I specifically wanted the Ryu algorithm, linking to a reference implementation in Erlang (https://t.co/388CdYIMW2) and to the paper (https://t.co/VGvtjlvP79).
I did one attempt with Sonnet, another with Opus, and while they both delivered a project with a passing test suite, both implementations were wrong and incomplete.
In the first attempt, many of the tests were fabricated, to match the faulty implementation. With wrong code and wrong tests, there was not much to salvage. Time to start over.
In the second attempt, now with Opus, I suggested it could generate all possible printable values for f16 from the canonical Ryu implementation in C (since there are only 65k of them), and use that to validate the algorithm. Once again, it delivered a passing suite, but it made one crucial mistake early on: when generating the reference table, they cast f16 to f32 before printing (a subtle mistake many would make), which led to wrong reference values. And because the reference table was wrong, it lead to all sorts of wrong decisions downstream, such as adding casting and deltas.
That's when I decided to be in-the-loop and break the problem into smaller ones:
1. I asked Claude to create a f16 reference table and made it clear in the prompt that any sort of casting would lead to the wrong solution. That's the reference table you can find in the PR
2. Then I asked it to explicitly port the Erlang algorithm, as is, and then parameterize the constants in the algorithm to make it generic (so it works for f16/f32/etc). Then write a test comparing all reference f16 values
3. Then I moved the algorithm to Nx. Since pretty printing is now precise, it broke 150 tests, which I used Claude to fix (with specific instructions to change only the precision in numbers and not touch anything else)
Claude still made mistakes but because I broke the problem into small steps, and verified their correctness each step along the way, I avoided bad decisions cascading through the whole implementation. And yes, using Claude was still extremely helpful (honestly, if I used Claude only on step 3, it would have already been worth the price tag).
Those two experiments have been orders of magnitude smaller than the browser one, both they seemed to work, but were flawed upon deeper inspection. For now, I'd still advise staying in the loop and avoiding falling into this trap.
It's still a bit shaky and bleeding-edge, but the "Ralph Wiggum" plugin in Claude Code is the first version of what's to come with autonomous, agentic loops.
It's a "we learn from failure"-centric approach. You define your goal condition and let the agent loop over and over until it has verifiably reached that promised goal.
It might take 2 minutes or a day. But the loop continues to experiment and look at prior work to ultimately get you there.
I've been seeing solid results with that. Takes some massaging and setting things up right (mostly for there not to be any interruptions), but when it works, it WORKS.
You can install this inside your CC by going to /plugin and typing `ralph`
This is the most fun moment to be a developer in years.
The AI tools are imperfect, the patterns are still emerging, and there's genuine room for experimentation. Roll up your sleeves and build something. The earthquake is further opening up what's possible.
The best news about this new layer: traditional engineering skills are more valuable than ever, not less. It helps us minimize shipping slop.
Developers who already invested in CI/CD, testing, documentation, and code review are having the most success with AI tools. These "boring" foundations are accelerators. They turn agents from chaos generators into productivity multipliers.
The real opportunity is learning to work at a different altitude. Instead of typing syntax, we're reviewing implementations, catching edge cases, and shipping features in hours that used to take days. That's genuinely exciting.
Yes, there's a learning curve. Understanding how to provide context, iterate on plans, and review AI-generated code quickly takes practice. But this is learnable through doing - build small tools, review everything, develop intuition through repetition.
The multiplier potential is real when you combine AI speed with engineering judgment. We're not replacing coding skills but we're finally able to focus them on the interesting problems while delegating the tedious parts.
I've been saying all year that giving the agent a shell + the file system removes mountains of complex abstractions. Glad to see some return to basic stuff that works proving itself out
https://t.co/C6MOglPujq
Congrats to Rust, Gleam (welcome!!) and Elixir on being the top 3 most admired languages on the StackOverflow Survey and Phoenix for being the most admired web framework for the third year in a row! https://t.co/zycBJ8ncTu
Why aren’t you pairing more with him? He types twice as fast as you.
Of course he does. So does a cat having a seizure on a mechanical keyboard. But that doesn’t mean it should be writing production code.
😂
https://t.co/6pqWLqbQDE
The golden rule still mostly holds:
The ones talking about numbers are seeking validation.
The ones just building are the ones doing the real numbers.
Do we see the founders of Lovable, Bolt, Cursor, etc show off numbers every day?
Most BBC traffic is going through an Elixir-powered routing layer and it is all running on 12 nodes: “fewer incidents, better spike handling, more confidence”.
Appreciating Elixir #myelixirstatus:
~1.5M ARR business with > 2000 users
3 servers running @flydotio shared-cpu-2x:4096MB
< 200MB RAM average
42ms response time average
AND, we're super overprovisioned on servers. I had a memory leak and just fixed it. Will drop to 2GB servers
Thank you @arvidkahl
This and The Embedded Entrepreneur were seminal in my journey building @AttendistHQ
So much more to share very soon.
Thank you for being open about your process.
Nice to see a picture of me from the lockdown years
I still think back to this day a lot, when so many amazing people bought my first book and gave it a place in their home. Almost five years ago now. 🥰
Still selling a couple every day. Still grateful for every single one of them.
Introducing Tidewave: beyond code intelligence.
While working on our web apps, we run code, query the database, read logs, search docs… but AI tools are limited to compiling code.
Watch Tidewave transform Claude Desktop into an agent by running a MCP server in your web app!
It's a little hard to believe. Fourteen years ago today, I launched Buffer from my apartment in Birmingham, in the UK.
Today the business generates $1.65 million per month, serves 59,000 customers, and enables fulfilling work for 72 people.
I think "my idea takes a long time to build" is usually a masive red flag (unless you're building a nuclear power plant or similar)
It usually just means you can't minimize the requirements enough (Elon-style) and are massively procrastinating and overcomplicating it
The first version of your site/app/startup doesn't have to be great or polished but it should solve the primary problem a user has quickly
Everyone these days can put up a waiting list with email box, I can do that in literally 1 minute, it's useless
Don't be lazy, just build the thing
9am - noon: everyone on the team talks to users
noon - 10pm: we sit in a room and build what they asked for
rinse and repeat, 6 days a week (we’ve been at it for ~1.5 years)
things fall into place if you do it long enough
the view gradually gets better too 🏔️
@DanielLockyer@levelsio I suggest doing what @dannypostma did with his SEO course.
- Create right now a waitlist page.
- Start creating a presale page, no need to work on the course yet.
- Put a limited special price ($59/69).
- Put tiered prices regarding sales volume.
Enjoy.