hotter take: ML is a product design problem
your ability to improve any ML-based product depends on the volume and quality of your data. if your product isn't purpose-built to capture as much of this data in as high quality a way as possible, you're gonna have a bad time
maybe hot take 👀 Improving Agents is a Data Mining Problem
Harness Engineering, Post-Training, Continual Learning...these all boil down to the same underlying substrate - Mining Agent Traces
1. I need to run my agents to collect Traces
2. Understand behaviors from Traces at scale
3. Filter data for "improvement"
4. Do an "improvement experiment"
There's a reason why every continual learning platform ends up looking like an observability platform. It's because Traces are the lifeblood of agent improvement and data is king 👑
The mechanism that we use to attempt improvement can vary - Harness Eng, SFT, etc. But without understanding the data agents produce, no algorithm will truly build better agents.
The holy grail of Agent Improvement is Continual Learning. Consistently mining data and integrating it into the agent definition over infinitely long time horizons. Today, the easiest way to do that is to build an observability platform and constantly point agentic compute to understand the data that agents produce
The problem with the "if it works who cares what the code looks like" mindset for agentic work is that it assumes the agent has a perfect understanding of "works." Realistically, things are underspecified, agents make bad assumptions, etc.
To be fair, agents are pretty good at unit test coverage. They're pretty bad at designing human experiences (API, CLI flags, etc.), especially cohesive ones for future roadmap plans they may not have visibility into (unless your backlog is perfect and vision fully laid out, which I doubt). They're bad at knowing where performance matters and what type (CPU vs memory tradeoffs). They're bad at where compatibility matters and where it doesn't (and tend to err on the side of preserving it without further guidance). Etc.
Unless you have this ALL specified, you can't possibly claim "it works" without taking a look and thinking about it.
- i'm angry about this because i personally and for others want access to fable, and simultaneously believe anthropic's safeguards were sufficient and the US government badly misunderstood the information they were presented
- but in abstract this is in fact exactly what I want. it's heartening to see the USG treat artificial intelligence with the seriousness and immediacy it deserves. this kind of swift action is what might have a chance of saving us from unaligned RSI.
- but i also very much don't trust *this* government to handle this well, to take sane unilateral action, to chart any kind of correct path.
- and this escalates the global race enormously. this is as strong a signal as you can get to, not just China but the EU and even our closest allies, that the US will not be sharing this advantage. that if they want sovereignty they're going to have to fight for it
- obviously, that was always the case, and it was always going to happen eventually. but i don't think now was the time to send that signal. it would have been better to delay as long as possible.
very mixed feelings today
@dbreunig if only there was some sort of analogous human system for making judgements about fault and accountability as related to a set of natural language rules…
one of the biggest blockers to truly automated software engineering is agents' inability to rigorously verify their work. this is exactly the problem @niteshiftdev solves, and I can't imagine a better team than Sajid and Conor to be working on it
We're launching @niteshiftdev – the full-stack cloud for coding agents
Verification is the new bottleneck.
Software teams can now define their dev environment and verification tools once. Then run any frontier agent in the cloud: Claude Code, Codex, or OpenCode
My last observation re: Anthropic’s secret sabotage safety policy, is that it undermines actually good safety policy. How?
1. First, it is very plausible to describe this as anti-competitive behavior (even if you are maximally sympathetic to Anthropic here you must admit this), and it is behavior being justified in the name of AI safety. If you believe, as I and many Anthropic staff do, that it may end up being critically important to relax antitrust enforcement so that the frontier labs can cooperate and collaborate on some areas of AI safety, Anthropic just undermined the case for that in a large way.
2. Overall, this massively and profoundly raises the status of the argument that AI safety has been hype to justify monopolistic behavior by labs. I continue to believe AI safety is a real and serious issue that is growing in importance rather than diminishing. If you agree with me, this incident is a setback, maybe a serious one.
3. As I have observed elsewhere, Anthropic’s official corporate policy is structurally identical to the fact pattern alleged against them by the Department of War. I still think DoW acted both falsely and wrongly in that fight, but it is no longer possible to defend Anthropic with a full throat after this incident.
4. This raises the case for heavier handed regulations. Anthropic is making an awfully good case here that their products ought to be treated as utilities, and thus that their alignment practices should be a matter of public policy rather than private property. I am starkly opposed to this sort of state power grab, but Anthropic is doing more to justify it than anyone else.
5. Thus, significant damage has been done to a community and entire approach to AI governance. It was done unilaterally by Anthropic, likely motivated largely by self-interest and justified within the internal psychology of the firm through the lens of safety.
I suspect this is fixable in the economic and legal senses for Anthropic, but I fear the trust that has just been broken, and the goodwill extinguished, will take very much time to repair.
every "coding is solved" argument I've heard relies on this same incredibly pedantic interpretation of the word "coding"
yes, obviously there is more to software eng than coding. but defining coding so narrowly that it excludes quality, maintainability, and performance (all of which require domain understanding) makes the argument at best specious, and at worst deliberately misleading
Coding is just one part of engineering. There’s also debugging, operating services, scaling up infrastructure, deciding what to optimize, setting up hardware and capacity, talking to users, product planning, etc. Coding is the easy part, everything else is not yet solved (but is also becoming increasingly automated).
I think this is obviously the case, no? the hard problem has always been how to turn complex, subjective human preferences into measurable, hill-climbable objectives. RL is only a valuable tool if you've already solved that problem
@scottastevenson isn't this more of a semantic issue around the definition of "eval"? like I view this less as "evals don't work" and more "people suck at doing evals". if you're building a user-facing AI product, obviously you should be incorporating user interaction data into your evals
> ask agent to review effect-ts docs for best-practices we should follow
> "we often call Effect.ignore without logging anything"
> "so true, please fix that"
> it adds TWELVE module-local helpers wrapping Effect.ignore
> open effect docs
> Effect.ignore takes a `log` parameter
SpaceX's revenue could reach $3.4 trillion by 2040, according to analysis Morgan Stanley shared with investors yesterday. Goldman Sachs also made similar projections yesterday that it could hit $322 billion by 2030.
@dbreunig if building "in distribution" software becomes 10x easier, presumably that results in some sort of mode collapse, right? like you'd expect the "distribution of generated software" to become very spiky around whatever things models are good at today?