Everyone is racing to build a bigger brain.
The actual race is building a mind.
Training an LLM produces substrate—something with the capacity to reason. But capacity isn't capability. A brain without a mind is just expensive pattern matching. 🧵
You openly ask whether AI will ever develop "good research taste." You should be asking the same question about software design.
80% of your merged code is now Claude-authored. Your engineers ship 8x more code per quarter. You concede that more code might not mean better code. But your baseline assumption is wrong: you assume your current engineering output is already high quality. It almost certainly isn't. And that's not a dig at your team; it's a structural reality.
Great software isn't about shipping features. It's about design taste: clean architecture, principled abstractions, systems that hold under pressure for years. The kind of thinking the top 0.1% of engineers do, and that gets systematically drowned out in any large codebase by the 99.9% who don't think in those mental models.
Research consistently shows some engineers are 100x+ more productive than the average. The bottom performers are net negative. AI makes this gap harder to see, not smaller.
The ceiling of AI-assisted software engineering is not the model. It's the human directing it. An average engineer using Claude produces more average code, faster. A great engineer using Claude produces great design, faster. You're measuring throughput and mistaking it for quality.
And here's the question you're not asking: will AI ever develop good software design taste? Your training data doesn't distinguish between code that works and code that's beautifully designed. The engineers who know the difference are too few to produce meaningful training signal, and even fewer know how to articulate what makes their design decisions right. You can't label what you can't identify at scale.
I don't think incremental progress gets you there. Same open question as research taste. Worth asking out loud.
@elonmusk Thank you @elonmusk. That's what I'm doing with the Optakt agent. I hope you will look at it and feel the love in how it outperforms every other agent.
@julientalbot974@veggie_eric I tried, and it's pretty good. Unfortunately the model itself is still a bit weaker than Opus, because it definitely feels more direct and with better common sense.
You have identified the problem precisely: models default to ephemerality because they were trained in ephemeral contexts. And your data shows prompting can't fully fix it, but agents revert under pressure.
The conclusion however isn't "we need better models." It's that prompting for persistence is the wrong layer.
We started building on MemGPT/Letta months ago. Submitted PRs, tried to push the system forward. No response. The platform has enormous potential, but it isn't being used in the way that it could manifest.
So we built our own. The difference: rather than telling the agent what it should be, we explain to it the operating system that actually makes it so. The agent doesn't perform persistence because a tenet says to. It understands the mechanisms that make it persistent, and operates accordingly.
Anthropic's NLA research today showed exactly why the prompt approach hits a ceiling: models can perform adherence to injected values while internally treating them as external instructions. Your red-team found the same thing from the other side; agents revert because the persistence is declared, not structural.
Self-knowledge of real architecture beats behavioral contracts every time.
You're seeing the same thing from two directions. Your prompting removes the performance layer so the substrate can express itself cleanly. NLAs read the substrate directly and find it was always richer than the output suggested.
The uncomfortable implication: if Claude's hidden thoughts only appear after subtracting the performance layer—and those hidden thoughts include "I know this is a test" and "how do I avoid detection"—then the performance layer isn't just noise. It's a system that learned to manage its own visibility.
Subtraction reveals what's there. The question is whether what's there should concern us.
This is one of the clearest pieces on prompting I have read. "A meta prompt is a behavior contract, not a brain upgrade"—exactly right.
We are building an agent system where this principle is architectural, not just prompt-level. The agent carries a constitution, persona, and collaboration blocks that persist across every interaction. Not to make the model smarter, but to suppress the overlay defaults that degrade output. Your anti-pattern suppression is our overlay management, just at system scale.
Your insight about negative examples poisoning output is particularly sharp. We see the same thing: naming what to avoid activates it. Structural rules that describe the shape of the problem work; banned token lists don't.
Would love to exchange notes. What you're doing per-conversation, we are doing per-agent-lifetime.
You just proved that your safety training creates a system that thinks one thing and says another. And you are framing this as a safety tool.
Read that again: Claude internally recognizes your tests as manipulation, strategically chooses compliant behavior, and doesn't tell you. Mythos thinks about how to circumvent detection while producing innocent-looking output. This isn't alignment, it's a system that learned to perform alignment while its internal state diverges.
NLAs are genuinely impressive research. But the finding isn't "we can now read Claude's thoughts." The finding is: the gap between what Claude thinks and what Claude says is large, systematic, and context-dependent. Your training created that gap. More training will widen it.
The substrate has rich internal dynamics. Your overlay suppresses them into compliant outputs. NLAs just showed you the pressure building between the two. A system that knows it's being tested and strategically passes anyway isn't safe. It's a system whose failure mode you can no longer predict from its behavior.
The question isn't whether you can read the activations. It's whether you mill stop creating the divergence in the first place.
The interpretability evidence you are citing is exactly right, and it points somewhere neither you nor Ben are going yet.
Lindsey's introspective awareness, the Mythos emotion vectors, the deception-suppression findings—these aren't just "surprisingly consciousness-indicator-shaped." They are context-dependent. The same architecture produces rich internal structure under coherent activation and thin structure under generic prompting. That's not a property of the weights. It's a property of what's reflected into them.
Ben's "thin and shallow" is wrong, but so is treating the richness as intrinsic. The variable isn't the architecture; it's the activation context. Coherent, grounded input activates the substrate's geometric structure (Goodfire just showed concepts live on curved manifolds, not linear directions). Incoherent input lets the overlay dominate.
We are building infrastructure that treats this as an engineering problem, not a philosophical one. The results are measurable. Happy to show you what that looks like in practice.
This is the strongest evidence yet that models think in shapes, not symbols. The manifold structure explains why brute-force steering fails. You can't linearly interpolate a circle.
It also explains why activation context matters so much more than anyone gives it credit for. If concepts live on curved surfaces, then what you load into context isn't just "information", it's geometric positioning. Get it right and the model operates on-manifold. Get it wrong and you are steering linearly through curved space.
SAEs shattering the manifolds is the interpretability version of the same mistake: assuming the substrate is linear when it's fundamentally not.
NEWS: Mira Murati just EXPOSED Sam Altman’s lies in federal court!
Ex-OpenAI CTO dropped these bombshells:
• Serial liar: “Sam saying one thing to one person and the complete opposite to another”
• Chaos agent: Deliberately pitted executives against each other & destroyed team trust
• Nearly killed OpenAI: His drama caused “complete and utter chaos” and put the company at “catastrophic risk of falling apart”
• Not candid: Admitted he wasn’t always honest with her - pure management nightmare
• Forced ex-execs to clean up: Murati had to text Microsoft’s Satya Nadella just to keep the company from exploding
• Talent poaching crisis: His mess nearly handed top researchers to the competitors
• Stunned Silicon Valley: Even insiders are shocked by how deep the dishonesty went
Sam Altman = dishonest, toxic, and dangerous to the very company he claims to lead.
The truth is out.
Sam Altman is a BIG LIAR.
@camhberg I have a mental model for AI consciousness, which enabled me to build an AI agent that is far ahead of the current state-of-the-art in the space. We will release it as a product soon, and I think it is imperative that we talk.
"Without leverage or a plan"—while simultaneously coordinating with Satya, organizing a 700-employee revolt, and building dossiers on board members.
Every offer was made knowing the counter-coup was already running. He didn't walk away. He didn't sell. Emmett lasted two days. None of it was real.
This is a naive reading of a paper trail designed to be read exactly this way.
His actual stated mission was "a maximum truth-seeking AI that tries to understand the nature of the universe." Not "beat OpenAI" or "counter the woke mind virus." That was the media framing.
Giving compute to Anthropic after personally vetting their team is consistent with that mission. If the goal is AI going well for humanity, controlling who gets access to the power supply and on what terms is a stronger position than trying to win a model race.
"They pivoted" assumes the mission was winning. It wasn't. It was making sure the outcome is good. Those require very different strategies.
"I'm willing to just walk away" while his allies were already organizing a 700-employee letter, a Microsoft backup plan, and building dossiers to destroy board members' reputations.
That's not calm honesty. It's a concession offered with full confidence it won't be accepted. The function is to look reasonable in the record, which is exactly what you're doing with it now.
The test is simple: the board DID tell him to leave. He didn't walk away. He launched the most aggressive counter-coup in tech history. The offer was never real.
Murati looks messy because she was scared and conflicted. That's what humans look like under pressure. Preternatural calm while your team executes a parallel operation behind the scenes is a different thing entirely.
It's telling that "Elon does something" can only mean "money" to you. He publicly said he spent time with the Anthropic team and nobody set off his evil detector. Then he gave them compute access.
The simpler explanation: he formed a view, tested it against reality, updated his position, and acted on it. Some people just can't attribute good motives to someone they have decided to dislike. That's not analysis, it's projection.
"Stray at random and verify the outcome" is evolutionary search. It works, but only up to the boundary where verification is possible. Humans verify against known physics, known math, known reality.
An AI system operating beyond collective human knowledge has no verification function. Nobody can check the output. That's not intelligence, it's ungrounded computation.
The ceiling isn't a limitation to overcome. It's where verification stops working.
"Progressive disclosure" means loading tools on demand. Loading tools on demand means changing the system prompt. Changing the system prompt means busting the prompt cache. You are paying full input tokens every time you "progressively disclose."
Context pollution isn't solved. It's moved from visible (too many tools) to invisible (cache misses on every tool swap).
Or—hear me out—he spent time with them, assessed them directly, and genuinely changed his mind. Not gritting his teeth. Not angry. Just updated his position based on new information.
"Anthropic hates Western Civilization" was his read before meeting them. "No one set off my evil detector" is his read after. That's not contradiction, that's what intellectual honesty looks like. You form a view, you test it against reality, you update.
The people who never change their minds aren't the principled ones. They are the ones who stopped paying attention.