Really fortunate to spend time at the White House yesterday with @sinasojoodi. So damn impressed by the people in our Gov't quietly doing the hard, complex, "I was told this would be impossible", work to improve the agencies they're a part of.... 🇺🇸📈🚀
@sinasojoodi and I were fortunate enough to spend some time at the WH yesterday. Really inspiring to see so many people in our Gov't quietly doing the hard, complex, "I was told this would be impossible", work to transform the way their agencies run for the country. 🇺🇸📈🚀
Judging by my tl there is a growing gap in understanding of AI capability.
The first issue I think is around recency and tier of use. I think a lot of people tried the free tier of ChatGPT somewhere last year and allowed it to inform their views on AI a little too much. This is a group of reactions laughing at various quirks of the models, hallucinations, etc. Yes I also saw the viral videos of OpenAI's Advanced Voice mode fumbling simple queries like "should I drive or walk to the carwash". The thing is that these free and old/deprecated models don't reflect the capability in the latest round of state of the art agentic models of this year, especially OpenAI Codex and Claude Code.
But that brings me to the second issue. Even if people paid $200/month to use the state of the art models, a lot of the capabilities are relatively "peaky" in highly technical areas. Typical queries around search, writing, advice, etc. are *not* the domain that has made the most noticeable and dramatic strides in capability. Partly, this is due to the technical details of reinforcement learning and its use of verifiable rewards. But partly, it's also because these use cases are not sufficiently prioritized by the companies in their hillclimbing because they don't lead to as much $$$ value. The goldmines are elsewhere, and the focus comes along.
So that brings me to the second group of people, who *both* 1) pay for and use the state of the art frontier agentic models (OpenAI Codex / Claude Code) and 2) do so professionally in technical domains like programming, math and research. This group of people is subject to the highest amount of "AI Psychosis" because the recent improvements in these domains as of this year have been nothing short of staggering. When you hand a computer terminal to one of these models, you can now watch them melt programming problems that you'd normally expect to take days/weeks of work. It's this second group of people that assigns a much greater gravity to the capabilities, their slope, and various cyber-related repercussions.
TLDR the people in these two groups are speaking past each other. It really is simultaneously the case that OpenAI's free and I think slightly orphaned (?) "Advanced Voice Mode" will fumble the dumbest questions in your Instagram's reels and *at the same time*, OpenAI's highest-tier and paid Codex model will go off for 1 hour to coherently restructure an entire code base, or find and exploit vulnerabilities in computer systems. This part really works and has made dramatic strides because 2 properties: 1) these domains offer explicit reward functions that are verifiable meaning they are easily amenable to reinforcement learning training (e.g. unit tests passed yes or no, in contrast to writing, which is much harder to explicitly judge), but also 2) they are a lot more valuable in b2b settings, meaning that the biggest fraction of the team is focused on improving them. So here we are.
Today I’m very excited to announce a global partnership between 8090 and EY.
EY will adopt 8090’s Software Factory and use it to help their customers break free from slow, costly and failure-prone legacy enterprise software using our AI-native software factory that reimagines the software development lifecycle.
EY is a massive global organization with more than 400,000 employees and tens of thousands of customers in every sector of the global economy.
8090’s Software Factory is the new way organizations can move to a focus on building software that is powerfully bespoke, hi quality, easy to maintain, easy to migrate and always consistent and up to date. No drift, no cruft, no waste.
Companies that build with Software Factory grow faster, are more profitable and are more adaptable in moments of change like we are witnessing today.
Let’s rewrite all the enterprise software in the world. EY and 8090 will work together to do its part.
🚨BREAKING: Alibaba tested AI coding agents on 100 real codebases, spanning 233 days each.
the agents failed spectacularly.
turns out passing tests once is easy. maintaining code for 8 months without breaking everything is where AI collapses.
SWE-CI is the first benchmark that measures long-term code maintenance instead of one-shot bug fixes.
each task tracks 71 consecutive commits of real evolution.
75% of AI models break previously working code during maintenance.
only Claude Opus 4 stays above 50% zero-regression rate. every other model accumulates technical debt that compounds over iterations.
here's the brutal part:
- HumanEval and SWE-bench measure "does it work right now"
- SWE-CI measures "does it still work after 6 months of changes"
agents optimized for snapshot testing write brittle code that passes tests today but becomes unmaintainable tomorrow.
Alibaba built EvoScore to weight later iterations heavier than early ones. agents that sacrifice code quality for quick wins get punished when consequences compound.
the AI coding narrative just got more honest: most models can write code. almost none can maintain it.
Before Software Factory, my team and I were building an AI SDLC manager for two years to do what 8090 offers. After using Software Factory I shut down our internal effort and am reorienting my team around Software Factory. This has already freed up two engineers.
The Software Factory team gets it and delivers on the core principle: holding software representation in requirements, not in code.
- @jbarseneau on Software Factory
Hell of a quote from @Citrini7 here:
"We had overestimated the value of 'human relationships.' Turns out that a lot of what people called relationships was simply friction with a friendly face."
We have more inbound demand for Software Factory than we can manage. We're looking for someone to own that, and to drive targeted outbound. You'll onboard customers, land initial deals, and expand them into seven-figure relationships.
You:
* Are early in your career and hungry
* Can recognize patterns and are obsessed with process
* Have grit and are a self-starter. Success in enterprise sales is often about persistence and genuine curiosity about your customers and their challenges. Early startups are hard
* Thrive in ambiguity and with autonomy
* Want to go all-in with an incredibly talented team
* Have evidence of exceptional ability
This isn't a traditional AE seat. We want missionaries, not mercenaries. Come build with us...