I’m finding that I have to start every project with an eval. Even the concept of a traditional product requirements document feels like it’s getting outdated. The conceit behind product requirements was that we actually understood what the requirements were. Often we don’t understand the optimal set of requirements to achieve goal. An agent can iterate on both requirements and the implementation very quickly. So what you really need is a measurable goal. Of course that’s not always easy either.
I really don't think the classes of errors Opus is making now are that hard to avoid. All the debugging I do as a human is very simple common sense. 0. Have an eval and run it on every change. 1. When eval items fail, sample the inputs, apply basic math to check if the outputs are plausible. 2. When errors are found, ask if they apply broadly or just to the sample. 3. Fix appropriately and recalculate. 4 Resample and continue. .... and don't say you're done until resampling is clean.
Now that I've given Opus all this independence, my app development is really moving faster, but the quality of progress is poor. He ignores broken tests, commits code that is failing tests, doesn't sanity check output, and tells me my evals are wrong instead of fixing the problem.
I'm so proud, he's just like a Google engineer. Time to turn "sanity check the implementation motherfucker" into a skill.md file.
Moved from Opus 4.7 in "ask permission" mode to 4.8 in auto mode. Feels like my 11 year old grew up to be 16, got a drivers license, and now feels like he can backtalk to me in language I barely understand while busting up the car every other weekend.
What comes after capitalism? New technologies tend to create newly valuable things and means of exchange. Agriculture => Land => Labor; Mechanization => Company ownership => Money
AI => Tokens => ?
Today we pay money for tokens, so you could say its just money again. But what if tokens become more valuable than money? Especially if AI is doing most of the labor so people can't exchange labor for money?
While initial attempts to regulate AI have failed in the US, I wonder how long plebs like us will have access to frontier models. We already lost access to the frontier Anthropic model Mythos due to its exceptional hacking capabilities. Intelligence is the ultimate weapon and money making tool. History suggests (think nuclear weapons, gun control, surveillance technologies, HF trading, etc) that governments, corporations, militaries and well funded criminal groups tend to either buy or bully their way into exclusive use of such tools.
Even if you think "the cat is out of the bag" with open source, they can seize control of the means of production - datacenters and GPUs.
If I ruined your weekend, my apologies. https://t.co/9Kry5iWpby
💯 which is why I am currently avoiding playing with the constant release of new types of agent harnesses. Any of the top agents or harnesses work pretty well. The opportunity is all in improving my agent skills, picking the right problems, and running as many agents parallel as i can. it's sort of like moving from riding a bicycle to flying a fighter jet. You can go further and you can also fuck things up a lot faster.
Another velocity improvement I experience with agentic coding is the benefits of shared mental models. When working with engineers (as a PM) often one of two problems emerges.
Some teams will just do what I ask without any discussion of feature-time-usability-implementation tradeoffs. Other teams will ignore what I ask thereby ignoring the customer requirements. Both of these lead to poor products.
When working with Claude (and its using my PRD and Design skills) I have a counterpart that is a good PM, Designer and Engineer. Both quality and speed increase since both parties can play all those roles.
The wins I’m seeing with software development are that the coordination overhead of big teams and cross functional collaboration drop 90%, just due to the fact that now a single individual can do the work of 10 SWEs and a designer and PM. And that collaboration was where most of the time was spent.
The teams seeing the biggest wins from AI are completely changing how they work, not speeding up what they already do. What steps can you delete, what handoffs go away, what can an agent just own end to end. Great to see Salesforce go this deep. Shoutout to Srini, @Benioff & team.
Full writeup: https://t.co/3Rbdj9K8YN
@TradingTutorLLC I can see going long here, beautiful chart. What are your thoughts on why leaps out of the money vs. a strike closer to 500? It reduces capital for the position, but does it improve RR?
How is the codex team thinking about the future version control? Is the pull request workflow still relevant? Given that agents create way more PRs than humans can ever review, it seems like the best practice is gonna move to rigorous integration tests and in-production observability with canaries. If that’s true, then we don’t really need something like the GitHub user interface anymore. We just need some CIC agents that triage issues when things get built and deployed and they only need MCP or a CLI to a shared git repo.
many of these companies have so much bureaucracy and make-work that layoffs are the only way to break the logjam. so the problem isn't that they couldn't think of ways to use the brain power, it's that many layers of management and value-removing process are in the way of innovation
very true. I went to two ENTs recently for the same ear condition. With the first one I led off with one set of symptoms and they immediately recommended an expensive risky surgery. Didnt bother to even do any tests . Maybe it's not a coincidence that their practice isowned by private equity. Next I went to a well-known research university Hospital, where I started describing one of the other symptoms I had. This time they actually did some tests, but still didn't figure out what the root cause was. Anyway, they ended up recommending a completely different risky procedure. My experience is that its much more likely to get a good answer out of AI. But of course the prompt and tests done does matter.