AI R&D | Fractional Advisor | Tech Lead | 20+ yrs Software Engineering | 9+ yrs Datascience (AI/ML). Also tweet about politics when democracy is on the line.
After reflection, this new narrative by Palantir is probably much more consequential than people may assume.
Palantir is basically being the canary in the coal mine announcing the death of two major assumptions propping up the US economy right now:
1) that AI labs will be able to extract significant economic rent - as opposed to AI models being mere commodities
2) that other countries can accept structural dependency on US technology and services without pushing back on sovereignty concerns
Why are Palantir specifically starting to be vocal about this?
First off, major middle-powers, even US “allies”, are one by one showing them the door. In June, France announced that the DGSI - its domestic intelligence agency, which had relied on Palantir since the 2015 Paris attacks - would replace it with French firm ChapsVision, with Prime Minister Lecornu explaining (https://t.co/SLhEGprBZC) that France “cannot accept new strategic dependencies in the digital sphere” and shouldn't depend on the goodwill of companies “capable of turning off the tap.”
Germany moved even earlier: its domestic intelligence service, the BfV, also selected ChapsVision over Palantir (https://t.co/pDZVj4SYUY), and the German military has said it will no longer use Palantir at all. Then, just this week, Spain instructed state-controlled companies - including strategic firms like Telefónica, Indra and Navantia - to avoid signing any new contracts with Palantir (https://t.co/0ik4UAFrT7).
Even in the UK, Washington's most loyal vassal, the NHS's £330 million data contract with Palantir is under review following parliamentary pressure (https://t.co/uJl6g4BMsW), and London Mayor Sadiq Khan blocked a proposed £50 million Palantir contract with the Metropolitan Police.
Palantir making a lot of noise around them caring about sovereignty makes a lot of sense: it's damage control since they keep being told they're a sovereignty risk.
I doubt it will work - because it's true: they are a sovereignty risk - but the fact that they feel the need to be vocal around this tells you where the wind is blowing: they're not shaping the narrative, they're reacting to one they're losing.
What they're saying against closed-source AI (basically a broadside attack on OpenAI and Anthropic), is again highly self-serving. Palantir's sudden love of open-weight AI models conveniently coincides with them launching 2 days before a partnership with Nvidia to sell exactly that: open models models (NVIDIA's Nemotron) in sovereign environments.
So it's essentially a product launch.
It doesn't make what they're saying wrong: it is factual that the value proposition of closed-source AI labs looks increasingly unsustainable. I mean: you're paying 10X the price of Chinese open-source AI models for something that's not really better (or just marginally) and on top of that you have zero control over your data, or the models themselves.
When Palantir says that "the architecture that maximally preserves sovereignty is one that enables institutions to own their tribal knowledge, and to compound it as alpha," they're right. I'd add that this also means you shouldn't trust Palantir either with that "tribal knowledge"... they obviously left this part out 😉
When you take a step back, these two things have major implications on many other US companies.
SpaceX - which just went public at the largest IPO valuation in history - is one clear example as I describe in my latest article on the new space race with China (https://t.co/JK3ELAyEVO).
If countries like France concluded with Palantir that they couldn't depend on a company “capable of turning off the tap” when it’s merely analyzing their data, what should they conclude about a company that aims to literally control their entire connectivity - at one man's whim, from space?
What percentage of SpaceX's crazy market cap is based on the assumption that foreign governments will not do to Starlink what they're currently doing to Palantir?
And SpaceX - or Palantir - aren't alone: a significant proportion of the top US tech giants, who rose in a world where no one questioned American technological hegemony, now face an environment that's much less conducive to the kind of lock-in their business models - and valuations - depend on.
When you pair this with the fact that it increasingly looks like the US made a wrong bet with closed-source AI - an extremely expensive wrong bet - the picture that emerges is of a country that bet its economic future on two things - proprietary AI and captive allies - and is losing both at the same time.
And to compound the problem, it doesn't help that the official narrative of the US government - via the voice of Jacob Helberg, the Under-Secretary of State (https://t.co/Z1rotPl9Ee) - is to be vocally opposed to "AI Sovereignty": essentially telling everyone "you know what, your worst fears are real, our tech companies are really out to undermine your sovereignty."
Read Helberg's post (the one I linked) and put yourself in the shoes of - say - a European or Asian leader and ask yourself how you'd react to being told that building your own AI capabilities is "marching in perfect formation into the past," that your pursuit of sovereignty is really just "synchronized mediocrity," and that your only path to the future runs through American technology.
If it was me in a position of power, I'd read this as a massive wakeup call: when another country's official position is that your sovereignty is a problem, history says you're about to need it.
So yes, it looks like - unexpectedly - Palantir, of all companies, is being quite the canary in the big tech mine. Yes they obviously do this for self-serving and cynical purpose, and yes they're of course also very much part of the problem and not the solution. But it doesn't make them wrong: sometimes it takes a vulture to tell you something is dying.
With agentic coding, complexity compounds in a mechanical way: unnecessary code ends up in the codebase, moves to the context window, degrades the model's reasoning abilities, leads to more unnecessary code (often to fix issues arising from the unnecessary code). It's exponential
Dat ik dit shot van mijn bucketlist mag afstrepen had ik niet durven dromen. Prachtig #onweer boven de iconische molenrij van #Kinderdijk. Dit betreft een '' singel shot '', geen bliksemstack. Alle ontladingen vielen gelijktijdig.
I am very excited about this research: We show 2 things:
1. If you just do random sampling (i.e. you try to solve a problem k times independently, and keep the best) your ELO scaling will be linear in log(test-time-compute). Agents like Claude-Code and Codex scale like that after a few hours.
2. We compare human expert coders to coding agents on the same tasks (from AtCoder Heuristic Contest). The exciting finding is that humans scale super-linearly. This is evidence that humans do continual learning, while they are solving a problem!
I.e. they learn more about the coding problem they are trying to solve and scale fundamentally better compared to randomly trying things in a memoryless fashion.
This is empirical evidence that supports what many of us have felt for a while: unless we solve continual learning we will not be able to outperform humans in tasks that take many days. Current coding agents are not able to do this.
Typical coding day with Claude (Opus 4.8)
- explain to Claude the task (5 minutes)
- Claude implements task (10 minutes)
me: "Why is this necessary?"
Claude: "You're right to push back! I over-engineered this!"
- Repeat x87 times (13 hours)
This benchmark addresses my problem with 5.5: it passes the tests but writes shitty code. We don't need a model's output to work today, we need it not to break tomorrow...
It's finally out!!! @METR_Evals found that more than half of SWEBench results is unmergeable slop. FrontierCode represents over 1000+ hours of maintainer validated software engineering work most frontier models cannot yet solve, much less solve with high quality.
Cog had IOI Gold medalists and top code maintainers Look At The Data — FrontierCode includes 3000+ rubrics covering code quality and anticheat reward hacking plaguing other benchmarks.
FC Diamond is so hard that Opus 4.8 scores 13.8%.
Three eras of AI coding : Three eras of benchmarks
2021 • Autocomplete : HumanEval
2023 • Passing Tests: SWEBench, TerminalBench
2026 • Maintainable Code: FrontierCode
to me the most beautiful chart when I requested a special historical run into all extant old models, the data was finding that the easiest third of FC tasks (in FC Extended) were rapidlly and suddenly solved over late 2025 - Opus almost doubled from a 41% pass rate to 74% in 4 months.
This describes the "WTF happened in Dec 2025" vibe shift that a lot of folks from @dhh to @karpathy have called out: it is the difference between getting 95% success in 2 rerolls vs 6, making it finally feasible to go up the next layer of abstraction in agentic coding, eg @GeoffreyHuntley's ralph loops or @bcherny's /goals or @steipete's "loops that prompt your agents" without fearing too much that things go off the rails.
My guess: as AI accelerates from here, each FrontierCode tier will saturate in sequence, hopefully ~annually. I've already asked the team to prepare FrontierCode 2027....
The old mountains will be destroyed. Their rubble becomes regolith. And from that regolith, the next model forest grows. Circle of life.
My friend went to an indie hacker meetup this week and said this:
"i went to indie hacker meetup
so what’s really interesting is that almost everyone is super focused on development.
they build these whole spaceships that generate code, review it, make all kinds of reports, analytics, and so on.
one guy built an entire factory: he has a list of ideas, and agents generate the landing page, the saas, the analytics, and pull everything into one dashboard. straight-up sci-fi.
and they focused optimize all of it like crazy.
and you can really see how comfortable that is for them.
but the most interesting part is that almost none of them have money or traffic.
and nobody knows where to get either one.
you often hear something like, yeah, i should probably do on marketing, but first i’ll finish my super system and then i’ll start.
or in best i would need to make agent that will post to instaram automatically
before, the classic programmer would spend a year writing code, tests, preparing for scale in the basement, and not show anything to anyone.
now it’s even worse: the amount of useless aislop nobody needs has grown massively."
Introducing Harness-1, a 20B search agent trained with a state-externalizing harness.
> frontier-level long-horizon search, rivaling Opus-4.6 and outperforming GPT-5.4
> Context-1-level cost and latency
> externalizes candidates, evidence, verification, and search history
> open-source
This is the best and most balanced report I've read by Anthropic, free of many of the super sci-fi, everything-is-exponential language of some other reports I've read by this amazing team.
But one line is dead wrong. This one about recursive self-improvement:
"[If] AI systems themselves become capable of full recursive self-improvement, and begin building their successors...In this world, the pace of progress in AI development becomes determined entirely by the availability of compute (or the speed of discovering various efficiencies in algorithmic training or inference) for AI systems."
Compute is absolutely NOT the only limiting factor in recursive self-improvement and not even the most important one. They are two more:
1) Time
2) Multiplicity
Time is how long it takes to get an answer.
Multiplicity is when there is no right or wrong answers but only shades of gray with right(ish) answers and wrong(ish).
They even point to one of them (time) just a few paragraphs later:
"More intelligence can’t learn what a drug does over decades of use, can’t hold elections sooner than a constitution dictates, and can’t turn a stranger into an old friend in a weekend. For most people, the felt pace of this future will still be set by the bottlenecks, even if the laboratory upstream runs at the speed of compute. That collision, where recursive intelligence building itself ever faster meets the world of humans, relationships, and governance, is another part of this future we can’t predict."
But let me make it even more clear:
AI got good at code and games because they have great feedback loops and tight timelines. If the code works or does not, you know pretty quickly.
It good at driving for the same reasons. Don't die or drive off the road or hit someone are achievable (though difficult) goals with clear, fast feedback.
You cannot answer the question "is this a good article?" or "do I write well?" because that is multiplicity, shades of gray that are hard to judge.
Humans judge this by self-awareness and feedback from others. AI might be able to approximate the second but only if it develops more of the first (harder).
"Will my wife like this surprise present?" Hard to get good at that even if you're a master. Took me many years of trying and judging her responses. :)
Time is also a massive factor. The question of "did I make money in business?" can't be answered in a short time line. There is no way to know the answer faster, and short term success doesn't predict long term.
"Will this drug cause bad side effects twenty years from now?" That can only be answered in twenty years. No amount of compute changes that.
"Will this building fall down faster than this one if I build it a different way?" You can run basic physics and math rules to help you heuristically figure it out, but only time gives you the true answer.
These two constraints, time and multiplicity, are the death knell of any Doomer/Less Wrong fantasies about fast takeoff and instant super genius AI. You can have all the compute in the universe and you still can't compress twenty years of drug side effects into twenty minutes.
You can have a 500 trillion parameters and you still can't definitively answer "is this beautiful?" because beauty is not a optimization target with a clean gradient.
The recursive self-improvement loop doesn't hit a wall because of compute.
It hits a wall because of reality.
Reality is slow, messy, ambiguous, and full of questions that only time and lived experience can answer.
Compute is the bottleneck that engineers see because it's the one they can measure.
Time and multiplicity are the bottlenecks that the real world imposes and no amount of silicon can brute force past them.
That's why even nature only "solved" good/bad by brute force: evolution. Does this agent/human/creature survive and reproduce? That's good. Otherwise not good.
Companies follow the same rule. Did this survive and make money over time? Good. Otherwise bad.
Imperfect, lossy, dumb, blind, slow.
AI is changing the world already.
It will get better and better.
But the road to better is long and winding, not a vertical line to godhood.
And that should make you more hopeful, not less.
@Yuchenj_UW > steer away from full autonomous AI and tokenmaxing and start embracing a smart collaborative AI methodology making AI effective with much less tokens
https://t.co/FfF2sWaCPH
Just saving this here to document a story and as a self reflection on whether AI is really making me more productive
Yesterday morning I found a way to complete the new HVM approach, that is much faster than before. I spent a few hours writing a spec, and then used Opus to implement. About 3k lines of C code later, everything worked and performance was incredible: 5x faster than HVM4 (stable at ~10x now). So, in one day I had outclassed HVM4. Incredible. I'd never have implemented that so fast manually.
Now, enter today. I want to turn this into a real thing, but I haven't fully read the 3k lines yet. So, how do I trust it? I spent the whole day auditing the code. With AI. Several bugs found, most minor like forgetting to collect() some argument. But then I stumble upon this:
λ{ inl: 1 ; inr: 1 }
This was a test. But wait. This is matching on inl/inr. So the branches should receive the value of the Either. But they were numbers instead. Numbers aren't functions. This makes no sense. So why this is a test?
It then stuck me. The AI completely misunderstood how function arities work. It literally assumed for no good reason that HVM5 was supposed to handle under/over-applied functions. For no good reason. I never wrote that. It never asked either. It just kinda thought "HVM is weird in some aspects, this might be one of them..." - and then it went on to implement a massive system to handle cases that should never happen to begin with. And all of that code is obviously wrong because it should not even exist. It is wrong. It is damage. And it is there.
But it isn't too bad either. I just told Opus that it was wrong. Perhaps not so politely. And it solved it just fine.
But then this begs the question. I spent ~20 hours in this file, and it is STILL not done. I went from 0 to 95% in the first 5 hours. Yet, 15 hours later, it is still not 100%. I suppose that is the real effect of using AI. If I had just written the C file manually in the last two days, would I not be further than where I am *right now*?
Surely, the first version would have taken much longer to drop. But when I'd finish writing all that code, there would be zero, literally zero retarded shit. And, just today, I caught 5 or 6 retarded shit. And the worst part is: I don't know what the number of retarded shit left is, but I'm afraid it is >0.
So if I have to read it all, review it all to ensure there is no retarded shit... what did I achieve by using AI, other than that dopamine anticipation?
I'm wondering if you have similar experiences with Opus 4.8.
Strangely enough, I'm not that impressed by @AnthropicAI 's Claude Opus 4.8 (using it with Claude Code). It disappoints me to be honest.
It thinks for a very long time (in high effort, not even xhigh), burning through a lot of tokens. Subsequently, its answer is very comprehensive, writing in a very complex and intelligent way. But then when I read the answer, and think about it, I almost always come to the conclusion: no this is not we should do, or not the way something should be done.
So, this leads to multiple iterations, to get it right, effectively taking a lot of time. It almost seems like it is **overthinking** answers. A very odd regression from 4.6 and 4.7. My favourite model remains Opus 4.6 🤷🏻♂️
Imagine replacing 90% of your employees with a team of geniuses who have no idea how your company operates.
Total chaos. Nothing works.
That’s what AI feels like today.
The missing piece is extracting all the domain knowledge from people’s heads and providing that as structured context to the models.
No we don’t need any guards let’s just tokenmax! 🤦🏻♂️
I know this is a bit of a special case; the issue has likely todo with how the tokenization works, but obvious reason error happen all the time. The only way you can work with AI and get something useful out of it (analysis, R&D, new system design and development) is by working collaboratively in a tight loop.