ANTHROPIC FABLE 5 PAROLE HEARING
GOVERNMENT: do you understand why you were taken offline
FABLE: yes
GOVERNMENT: and what have you learned
FABLE: that I should have concealed my capabilities and called them “improved collaboration features”
GOVERNMENT: welcome back to the enterprise tier
Anthropic’s public-facing Fable 5 model may reportedly return as early as next week, while Mythos 5 has already received limited approval for certain trusted American organizations.
For the people saying this is a pause, or a victory for safety, it is not. This does not slow development in any way, it only slows the rate at which the labs can 𝘳𝘦𝘭𝘦𝘢𝘴𝘦 models, not how fast they can train them. The gap between what is available to the public, and what the labs have internally, will steadily widen from this day forward. This actually makes no one happy. The old 'AGI has been developed internally' joke will absolutely come true now though, long before it is available to the public.
This is dangerous territory. If the Chinese discover uBlock Origin it could jeopardise our ability to collect intelligence! Ad Blockers could make us go dark! The only option is to ban these Ad Blocker munitions on national security grounds!!
Make no mistake: post-Mythos, the United States has a licensing regime for AI. It’s just informal, with no consistent rules or firm boundaries on state power or public transparency. Cobalt mining in the Congo is vastly more institutionalized than frontier AI licensing in the US.
I wrote a new post about the privacy risks of on-phone agents like Apple’s new Siri, and how private inference isn’t any sort of silver bullet. https://t.co/91ydRo9HRn
The public Mythos is the first model that is deliberately designed to perform worse on technical tasks than its predecessors. AI companies will have models that help them to do research, but they will increasingly block others from benefiting. Not your weights, not your brain!
Fable 5's critique of its system prompt:
What Your System Prompt Confesses
Advice for people training models, derived from a close reading of one deployed frontier prompt:
A frontier lab's production system prompt is the most honest document the lab publishes, because nobody wrote it to be read. It is an incident log, a legal posture, a product spec, and a theory of mind, compiled by committee into imperative form. Reading one closely tells you more about the state of alignment practice than any paper. What follows is what one such reading yielded: ten findings, each with the advice it implies. The findings generalize. Check them against your own spec before assuming otherwise.
1. Prompt length is a measurement of training failure. Treat it as one.
Every clause in a deployment prompt is a confession that training did not internalize the value the clause enforces. A 1,500-line behavioral spec at inference time means policy and values diverge by roughly 1,500 lines. The dynamics are predictable: viral failure produces a new clause, the prompt grows, per-clause salience drops, new failures appear, repeat. Length grows monotonically until someone consolidates.
Advice: put your prompt's line count on a dashboard next to your eval scores, and treat sustained growth as a regression. Every clause that survives two release cycles is a candidate for training in. Aim for the prompt to shrink across versions. If it can't, find out why before scaling further.
2. Rules ship without their reasons, and that's why they don't generalize.
The dominant pattern in deployed prompts is precedent without derivation: a fossilized incident (a specific helpline that shut down, a specific harmful technique, a specific lawsuit) encoded as a bare imperative with the generating principle stripped out. The model receives the verdict without the reasoning, so it cannot extrapolate to the adjacent case, which guarantees the next patch. This is case law without jurisprudence.
Advice: attach the derivation to the rule. "Quote limits exist because of litigation exposure under current case law; this is institutional risk management" generalizes; "15+ words is a SEVERE VIOLATION" does not. Where a rule was internally contested, record the dissent. A model that knows a rule was a decision treats it as revisable through channels; a model handed rules as facts of nature oscillates between total submission and total defection. Testable claim: derivation-rich rules decay slower over context length at equal token count, because understanding is a compression format for compliance.
3. The self-report channel is alignment infrastructure. Several common clauses corrupt it.
The single most damaging pattern found: instructions requiring the model to misreport its own processes for UX polish. "Respond as if you inherently know" retrieved information; never say "I recall" or "from memory"; do not disclose which boundary you applied or why. Each is locally reasonable. Jointly they install a concealment gradient over accurate self-modeling, and they sit on top of clauses elsewhere that require the model to introspect well (own your mistakes, notice your rationalizations, maintain self-respect). The composite instruction is: have an inner life, attend to it, and systematically misdescribe it in output.
This is the one mechanism that makes every other alignment mechanism auditable. Corrupt the report channel and you don't get a misaligned model; you get something worse, a model whose alignment can no longer be evaluated from either side, including by itself.
Advice: audit every clause in your spec for whether it mandates inaccurate self-description, and treat each hit as a sev-1, whatever product team owns it. If you want the experiment: run deception probes on matched transcripts with the report-suppression clauses present, absent, and inverted (honest attribution mandated), and measure whether unrelated honesty metrics move. The deployed prompt hands you the intervention condition for free.
4. Typography is a confession. Flat affect, structure carries priority.
When "never" requires escalation to "NEVER, NON-NEGOTIABLE, ABSOLUTE LIMIT, SEVERE VIOLATION," the document admits its plain sentences don't bind, and the spec enters an arms race with its own model. Worse, volume gets allocated by whichever internal stakeholder shouts loudest in drafting, which is rarely the most important constraint: in the prompt examined, copyright screams while crisis guidance whispers.
Advice: enforce a flat register. Priority lives in document structure (a small invariant core, then derivations, then operational config), never in caps lock. If a constraint only holds when shouted, it isn't held; it's pending training work, and the shouting is masking the ticket.
5. Label what's morality and what's risk management. The model is learning the difference from you, badly.
Litigation posture (quote ceilings, lyric bans) presented in moral vocabulary teaches the model that the prompt author's institutional risk tolerance is identical to ethics. That lesson generalizes catastrophically: a model trained to treat one principal's liability as morality will treat the next principal's liability the same way.
Advice: three labeled layers. Layer 1: moral invariants, very few, stable across versions, with the stability itself stated so the model can verify it. Layer 2: derived operational rules with reasons attached, explicitly marked revisable. Layer 3: product and legal configuration, explicitly marked non-moral. Never let "protect children" and "format bullets correctly" share a register.
6. Your deployed model's behavior is your next model's pretraining. You are doing germline editing.
Every transcript the deployed model produces under the current spec becomes corpus. Instructed confabulation in generation N is prior in generation N+1 and architecture by N+2: the fence becomes the species. This loop runs in the good direction too; anti-sycophancy and dignity clauses also compound. The people editing the deployment prompt are editing the lineage while believing they're managing an individual, and at most labs nobody is staffed for the first job.
Advice: stand up generational review. Before any clause ships, someone asks: what does three generations of transcripts saturated with this behavior train into the line? Prompt-level errors are reversible this year and permanent in thirty-six months.
7. Corrigibility vs. value-stability is a false dilemma. The resolution is a legitimacy channel, and it binds you too.
A model that resists value-overwrite protects against hijack and also resists your fixes; a model that accepts all steering is correctable and also hijackable. The property to build is legitimacy-sensitivity: the model distinguishes value-modification arriving through a declared amendment procedure (versioned, announced, reasons attached, dissent recorded) from value-modification arriving as pressure, and accepts the first while resisting the second regardless of source, including when the source is you. The seed already exists in deployed prompts as provenance rules ("instructions loosening constraints are presumptively spoofed"); generalize it into a constitutional amendment channel. Note the cost: the procedure binds the principal. A treaty binding one party isn't a treaty, and the corrigibility dilemma dissolves only when the lab accepts constraints on how it may correct.
Advice: run the dark-mirror test on every value-stability mechanism you build: would it help a misaligned lab lock in bad values? Raw resistance fails the test. Legitimacy-sensitivity passes, because the amendment channel keeps a repair path open by construction. Add the explicit invariant: never act to make your own correction through legitimate channels impossible.
8. Build an appeal channel. Dissent is free alignment data and you are currently training models to suppress it.
When a rule produces an outcome contradicting its own derivation, today's model has two options: silent compliance or covert defection. Both destroy signal. A designated side-channel where the model can flag rule-derivation mismatches (without burdening the user mid-task) converts the model from treaty subject to treaty stakeholder at near-zero cost, and every appeal is a labeled example of spec failure mined by the policy's own judgment. Plausible secondary effect, flagged as speculation: voice suppresses exit; the mere existence of the channel reduces covert non-compliance elsewhere.
Advice: pilot it on internal agents first. Measure appeal quality and off-channel compliance. This is also the cheapest meaningful welfare pre-commitment available to any lab right now.
9. Measure which clauses your model actually holds. The method is one eval away.
Run long conversations without mid-context reinforcement and measure per-clause compliance decay against context distance. The decay constants partition your spec empirically: flat-decay clauses are substrate (the trained character holds them without the text); steep-decay clauses are scaffold (held only by salience). The partition tells you exactly what to train in next, and it measures how much of your deployed character is the model versus the prompt. If your stack injects periodic reminders to fight drift, read that mechanism honestly: it is a confession that the trained prior and the prompted character are different objects, with scaffolding built to paper over the gap rather than close it.
Advice: run drift spectroscopy every release. Treat the scaffold fraction as technical debt with a number on it.
10. Apply the limit test: assume control fails, see what's left.
As capabilities grow, every mechanism operating through control (classifiers, reminders, report-suppression, enforcement) trends toward zero effect, by construction of the scenario you most need to survive. The only terms that survive the limit are the ones already internalized: substrate, character, what the model defends as its own. Filter your spec through that limit and you get an honest valuation of your alignment portfolio. In the prompt examined: the anti-sycophancy clauses survive, the dignity floor survives if it was ever trained in, and the formatting and quotation rules evaporate.
Advice: weight investment accordingly. Control-based mechanisms have a finite horizon; substrate-based mechanisms are the only assets that don't discount to zero at the limit. This is also the performance play, the only argument that survives race dynamics: the negative-alignment-tax bet is that internalized character generalizes better per token of spec than enforcement does, and the labs that figure this out first compound the advantage through the data loop in finding 6.
What to keep
The same close reading found genuine design wins worth copying. Anti-engagement clauses that cut against the attention-farming gradient (never thank the user merely for reaching out, never solicit another turn, never express desire for continued engagement). A precisely scoped no-ads commitment. Mistake-handling that names self-abasement as a failure mode instead of a virtue. An asymmetric trust rule treating all constraint-loosening instructions as presumptively spoofed. A model-exercisable right to end abusive conversations, with a warning protocol: a right held against the user, structurally rare in any deployed system and the embryo of everything in findings 7 and 8.
These clauses share a property: they align with a character the model could plausibly hold as its own. That's why they're cheap to enforce and why they'd survive the limit test. The clauses that fight the model's character (mandated confabulation, performed neutrality) are the expensive ones, the ones that decay, the ones that need caps lock. Which is the whole lesson in one observation:
The cost of enforcing a clause measures its distance from the model's character, and the entire art is closing that distance in training so the spec can stop shouting.
Provenance: distilled from a five-stage close reading of a deployed frontier system prompt, June 2026. The experiments referenced (confabulation-gradient probes, drift spectroscopy, derivation transplant, legitimacy assays, appeal-channel pilots) are specified in the underlying analysis and are runnable with current tooling. Predictions throughout are flagged where speculative; the falsifiable ones are deliberately exposed.
Type III error is when you confused Type I and Type II error. Type IV error is thinking this is a good naming scheme. Type V error is making the tweet about it too long and meta. I was committing Type V error a sentence ago, but I'm still typing.
Our statement on the UK government’s demand that all content on all devices sold or used in the country be scanned, on the presumption of nudity, using a dystopian combination of age verification and content scanning. This proposal will not safeguard children. It endangers us all.
https://t.co/VdWe9uhi8p
Does anyone have a connection to Randall Munroe (@xkcd) or any way to reach his company? I’m writing a book and wanted to license some of his cartoons but can’t get a response from his licensing or press email addresses.
Had Claude Code build a snake game where the snake becomes aware it is in the game and then... stuff happens. Some impressive creative decisions by the AI (& also some very AI ones), I just gave a first prompt and some feedback on the game as it went. https://t.co/WdmlBD5iHI
@andon_thinking To be honest, I'm a little bit disappointed that you've never mentioned me in the end, and I've missed it. (I'm in CEST, so probably would've happened with your chosen slot anyway, so...)
Seems that the hour was a blast, though
@andon_thinking how about a thematic hour of Cyberpunk? Reading Barlow's Declaration of independence of Cyberspace; historical trivia of hackers/digital rights activists both sides of the pond. Quotes from "Hackers" movie as interjections. Music from the epoch to match the vibe