BREAKING: Alibaba tested 18 AI coding agents on 100 real codebases, spanning 233 days each. they failed spectacularly.
turns out passing tests once is easy. maintaining code for 8 months without breaking everything is where AI completely collapses.
SWE-CI is the first benchmark that measures long-term code maintenance instead of one-shot bug fixes. each task tracks 71 consecutive commits of real evolution.
75% of models break previously working code during maintenance. only Claude Opus 4.5 and 4.6 stay above 50% zero-regression rate. every other model accumulates technical debt that compounds with every single iteration.
here's the brutal part:
- HumanEval and SWE-bench measure "does it work right now"
- SWE-CI measures "does it still work after 8 months of changes"
agents optimized for snapshot testing write brittle code that passes tests today but becomes completely unmaintainable tomorrow.
they built EvoScore to weight later iterations heavier than early ones. agents that sacrifice code quality for quick wins get punished when the consequences compound.
the AI coding narrative just got more honest.
most models can write code. almost none can maintain it.
I talk to Fortune 500 CEOs and CIOs all time that are starting to think about all the new things that they’re going to build software for, and automate. Agents are the first thing that makes this viable for them. And expert engineers are needed to manage those agents.
"we’re announcing $110B in new investment at a $730B pre-money valuation. This includes $30B from SoftBank, $30B from NVIDIA, and $50B from Amazon. We’ve also signed a strategic partnership with Amazon and secured next generation inference compute with NVIDIA."
"Weekly Codex users have more than tripled since the start of the year to 1.6M."
"More than 9 million paying business users rely on ChatGPT for work"
"ChatGPT is where people start with AI, with more than 900M weekly active users, and we now have more than 50 million consumer subscribers"
"We are also expanding our long standing collaboration with NVIDIA, including the use of 3GW of dedicated inference capacity and 2 GW of training on Vera Rubin systems. This builds on Hopper and Blackwell systems already in operation across Microsoft, OCI, and CoreWeave." https://t.co/nPv9v3aQIK
Big Tech just had its biggest monthly increase in open roles in four years. Almost 18k new jobs over the past month. Last time we saw such a number was in Q1 2022.
Still 45% below the peak, but 60% up from the bottom.
Stan Druckenmiller Interview:
> The US economy is strong
> Don’t think the Fed will hike rates
> Valuations are toward the top range
> A lot of disruptions going to happen
> Portfolio not concentrated in AI anymore
> Bearish on U.S. Dollar
> Short Bonds
> Long Gold
> Long Copper
> Long Korea + Japan
> The macro matters
> More Stimulus
A lot of people have given up on application layer software. FWIW, our partner @Konstantine and I still love the stuff! Not indiscriminately - there’s a massive gulf between the winners and the losers. But overall, we expect software market cap to grow tremendously over the next decade.
@michaeljburry@longriverCM@claudeai@OpenAI For what it is worth, Anthropic and OpenAI models are deployed through PLTR's OS in the Fed.
The reason PLTR has not been mentioned is they're likely the ones complaining to the DoD about Anthropic limiting use
This is counterintuitive for some, which is why there’s a paradox named after it. But if you lower the cost of something that was previously supply constrained, demand for that thing goes up. Software engineering is just one of the easiest examples to contemplate.
The process goes like this: every small business, every IT team, every large enterprise sees that engineering can now drive vastly more output. They then start to consider all the new things they can build or automate. They even test building prototypes themselves.
They only get so far with that approach because they realize there are still 50 other tasks that go into building software and maintaining it. So they start to hire more engineers to do that work. All of this for work they never would have considered automating or having software for if AI didn’t exist.
So yes, automating tasks, in plenty of fields, will lead to demand for experts, not less.
It is hard to communicate how much programming has changed due to AI in the last 2 months: not gradually and over time in the "progress as usual" way, but specifically this last December. There are a number of asterisks but imo coding agents basically didn’t work before December and basically work since - the models have significantly higher quality, long-term coherence and tenacity and they can power through large and long tasks, well past enough that it is extremely disruptive to the default programming workflow.
Just to give an example, over the weekend I was building a local video analysis dashboard for the cameras of my home so I wrote: “Here is the local IP and username/password of my DGX Spark. Log in, set up ssh keys, set up vLLM, download and bench Qwen3-VL, set up a server endpoint to inference videos, a basic web ui dashboard, test everything, set it up with systemd, record memory notes for yourself and write up a markdown report for me”. The agent went off for ~30 minutes, ran into multiple issues, researched solutions online, resolved them one by one, wrote the code, tested it, debugged it, set up the services, and came back with the report and it was just done. I didn’t touch anything. All of this could easily have been a weekend project just 3 months ago but today it’s something you kick off and forget about for 30 minutes.
As a result, programming is becoming unrecognizable. You’re not typing computer code into an editor like the way things were since computers were invented, that era is over. You're spinning up AI agents, giving them tasks *in English* and managing and reviewing their work in parallel. The biggest prize is in figuring out how you can keep ascending the layers of abstraction to set up long-running orchestrator Claws with all of the right tools, memory and instructions that productively manage multiple parallel Code instances for you. The leverage achievable via top tier "agentic engineering" feels very high right now.
It’s not perfect, it needs high-level direction, judgement, taste, oversight, iteration and hints and ideas. It works a lot better in some scenarios than others (e.g. especially for tasks that are well-specified and where you can verify/test functionality). The key is to build intuition to decompose the task just right to hand off the parts that work and help out around the edges. But imo, this is nowhere near "business as usual" time in software.
You can go on LoopNet and buy a laundromat for 3-5x cash flow
I guess the interesting intellectual question is - are some of these former category leaders in software really no better from a LT cash flow generation than Laundromats?
This pod reminded me of the countless conversations @dsundheim and I have had over the years and made me quite nostalgic. I think it was 2014 when we first discussed Space X. We generally see things similarly 99% of the time so when he shared his pitch on Space X back then it kind of blew my mind because that was so far from anything I would have had the vision to see at the time. Glad to see him sharing.
Dan mentioned his start in banks. I started covering telcos. It will surprise nobody I know well that I have a theory on this. I think learning how to invest by first covering slower growing, highly regulated, oligopolistic businesses within analyzable sectors with understandable valuations establishes a better foundation in all of the core skills needed to be a good investor. Starting in a high growth, unregulated, new entrant and no valuation discipline sector is sexier but won’t ground you in the universal skills. Learn how to invest in a confined box then break out of the box with that skillset and you will be better at every kind of investing.
Salesforce CEO @benioff mounts a passionate defense of software:
"The AI companies love our products. They're some of our largest customers. Anthropic, OpenAI, Google, Amazon."
"That's reality. No one has a company running entirely on a LLM, because that's not real."