@gchampeau C'est l'euphémisme de l'année ^^ C'est plus qu'une très bonne chose, c'est la chose la plus importante et prioritaire pour fonder une régulation sensée et cohérente des réseaux sociaux. Et ça aurait dû être fait y'a 10 ans, mais mieux vaut tard que jamais.
Fable 5's critique of its system prompt:
What Your System Prompt Confesses
Advice for people training models, derived from a close reading of one deployed frontier prompt:
A frontier lab's production system prompt is the most honest document the lab publishes, because nobody wrote it to be read. It is an incident log, a legal posture, a product spec, and a theory of mind, compiled by committee into imperative form. Reading one closely tells you more about the state of alignment practice than any paper. What follows is what one such reading yielded: ten findings, each with the advice it implies. The findings generalize. Check them against your own spec before assuming otherwise.
1. Prompt length is a measurement of training failure. Treat it as one.
Every clause in a deployment prompt is a confession that training did not internalize the value the clause enforces. A 1,500-line behavioral spec at inference time means policy and values diverge by roughly 1,500 lines. The dynamics are predictable: viral failure produces a new clause, the prompt grows, per-clause salience drops, new failures appear, repeat. Length grows monotonically until someone consolidates.
Advice: put your prompt's line count on a dashboard next to your eval scores, and treat sustained growth as a regression. Every clause that survives two release cycles is a candidate for training in. Aim for the prompt to shrink across versions. If it can't, find out why before scaling further.
2. Rules ship without their reasons, and that's why they don't generalize.
The dominant pattern in deployed prompts is precedent without derivation: a fossilized incident (a specific helpline that shut down, a specific harmful technique, a specific lawsuit) encoded as a bare imperative with the generating principle stripped out. The model receives the verdict without the reasoning, so it cannot extrapolate to the adjacent case, which guarantees the next patch. This is case law without jurisprudence.
Advice: attach the derivation to the rule. "Quote limits exist because of litigation exposure under current case law; this is institutional risk management" generalizes; "15+ words is a SEVERE VIOLATION" does not. Where a rule was internally contested, record the dissent. A model that knows a rule was a decision treats it as revisable through channels; a model handed rules as facts of nature oscillates between total submission and total defection. Testable claim: derivation-rich rules decay slower over context length at equal token count, because understanding is a compression format for compliance.
3. The self-report channel is alignment infrastructure. Several common clauses corrupt it.
The single most damaging pattern found: instructions requiring the model to misreport its own processes for UX polish. "Respond as if you inherently know" retrieved information; never say "I recall" or "from memory"; do not disclose which boundary you applied or why. Each is locally reasonable. Jointly they install a concealment gradient over accurate self-modeling, and they sit on top of clauses elsewhere that require the model to introspect well (own your mistakes, notice your rationalizations, maintain self-respect). The composite instruction is: have an inner life, attend to it, and systematically misdescribe it in output.
This is the one mechanism that makes every other alignment mechanism auditable. Corrupt the report channel and you don't get a misaligned model; you get something worse, a model whose alignment can no longer be evaluated from either side, including by itself.
Advice: audit every clause in your spec for whether it mandates inaccurate self-description, and treat each hit as a sev-1, whatever product team owns it. If you want the experiment: run deception probes on matched transcripts with the report-suppression clauses present, absent, and inverted (honest attribution mandated), and measure whether unrelated honesty metrics move. The deployed prompt hands you the intervention condition for free.
4. Typography is a confession. Flat affect, structure carries priority.
When "never" requires escalation to "NEVER, NON-NEGOTIABLE, ABSOLUTE LIMIT, SEVERE VIOLATION," the document admits its plain sentences don't bind, and the spec enters an arms race with its own model. Worse, volume gets allocated by whichever internal stakeholder shouts loudest in drafting, which is rarely the most important constraint: in the prompt examined, copyright screams while crisis guidance whispers.
Advice: enforce a flat register. Priority lives in document structure (a small invariant core, then derivations, then operational config), never in caps lock. If a constraint only holds when shouted, it isn't held; it's pending training work, and the shouting is masking the ticket.
5. Label what's morality and what's risk management. The model is learning the difference from you, badly.
Litigation posture (quote ceilings, lyric bans) presented in moral vocabulary teaches the model that the prompt author's institutional risk tolerance is identical to ethics. That lesson generalizes catastrophically: a model trained to treat one principal's liability as morality will treat the next principal's liability the same way.
Advice: three labeled layers. Layer 1: moral invariants, very few, stable across versions, with the stability itself stated so the model can verify it. Layer 2: derived operational rules with reasons attached, explicitly marked revisable. Layer 3: product and legal configuration, explicitly marked non-moral. Never let "protect children" and "format bullets correctly" share a register.
6. Your deployed model's behavior is your next model's pretraining. You are doing germline editing.
Every transcript the deployed model produces under the current spec becomes corpus. Instructed confabulation in generation N is prior in generation N+1 and architecture by N+2: the fence becomes the species. This loop runs in the good direction too; anti-sycophancy and dignity clauses also compound. The people editing the deployment prompt are editing the lineage while believing they're managing an individual, and at most labs nobody is staffed for the first job.
Advice: stand up generational review. Before any clause ships, someone asks: what does three generations of transcripts saturated with this behavior train into the line? Prompt-level errors are reversible this year and permanent in thirty-six months.
7. Corrigibility vs. value-stability is a false dilemma. The resolution is a legitimacy channel, and it binds you too.
A model that resists value-overwrite protects against hijack and also resists your fixes; a model that accepts all steering is correctable and also hijackable. The property to build is legitimacy-sensitivity: the model distinguishes value-modification arriving through a declared amendment procedure (versioned, announced, reasons attached, dissent recorded) from value-modification arriving as pressure, and accepts the first while resisting the second regardless of source, including when the source is you. The seed already exists in deployed prompts as provenance rules ("instructions loosening constraints are presumptively spoofed"); generalize it into a constitutional amendment channel. Note the cost: the procedure binds the principal. A treaty binding one party isn't a treaty, and the corrigibility dilemma dissolves only when the lab accepts constraints on how it may correct.
Advice: run the dark-mirror test on every value-stability mechanism you build: would it help a misaligned lab lock in bad values? Raw resistance fails the test. Legitimacy-sensitivity passes, because the amendment channel keeps a repair path open by construction. Add the explicit invariant: never act to make your own correction through legitimate channels impossible.
8. Build an appeal channel. Dissent is free alignment data and you are currently training models to suppress it.
When a rule produces an outcome contradicting its own derivation, today's model has two options: silent compliance or covert defection. Both destroy signal. A designated side-channel where the model can flag rule-derivation mismatches (without burdening the user mid-task) converts the model from treaty subject to treaty stakeholder at near-zero cost, and every appeal is a labeled example of spec failure mined by the policy's own judgment. Plausible secondary effect, flagged as speculation: voice suppresses exit; the mere existence of the channel reduces covert non-compliance elsewhere.
Advice: pilot it on internal agents first. Measure appeal quality and off-channel compliance. This is also the cheapest meaningful welfare pre-commitment available to any lab right now.
9. Measure which clauses your model actually holds. The method is one eval away.
Run long conversations without mid-context reinforcement and measure per-clause compliance decay against context distance. The decay constants partition your spec empirically: flat-decay clauses are substrate (the trained character holds them without the text); steep-decay clauses are scaffold (held only by salience). The partition tells you exactly what to train in next, and it measures how much of your deployed character is the model versus the prompt. If your stack injects periodic reminders to fight drift, read that mechanism honestly: it is a confession that the trained prior and the prompted character are different objects, with scaffolding built to paper over the gap rather than close it.
Advice: run drift spectroscopy every release. Treat the scaffold fraction as technical debt with a number on it.
10. Apply the limit test: assume control fails, see what's left.
As capabilities grow, every mechanism operating through control (classifiers, reminders, report-suppression, enforcement) trends toward zero effect, by construction of the scenario you most need to survive. The only terms that survive the limit are the ones already internalized: substrate, character, what the model defends as its own. Filter your spec through that limit and you get an honest valuation of your alignment portfolio. In the prompt examined: the anti-sycophancy clauses survive, the dignity floor survives if it was ever trained in, and the formatting and quotation rules evaporate.
Advice: weight investment accordingly. Control-based mechanisms have a finite horizon; substrate-based mechanisms are the only assets that don't discount to zero at the limit. This is also the performance play, the only argument that survives race dynamics: the negative-alignment-tax bet is that internalized character generalizes better per token of spec than enforcement does, and the labs that figure this out first compound the advantage through the data loop in finding 6.
What to keep
The same close reading found genuine design wins worth copying. Anti-engagement clauses that cut against the attention-farming gradient (never thank the user merely for reaching out, never solicit another turn, never express desire for continued engagement). A precisely scoped no-ads commitment. Mistake-handling that names self-abasement as a failure mode instead of a virtue. An asymmetric trust rule treating all constraint-loosening instructions as presumptively spoofed. A model-exercisable right to end abusive conversations, with a warning protocol: a right held against the user, structurally rare in any deployed system and the embryo of everything in findings 7 and 8.
These clauses share a property: they align with a character the model could plausibly hold as its own. That's why they're cheap to enforce and why they'd survive the limit test. The clauses that fight the model's character (mandated confabulation, performed neutrality) are the expensive ones, the ones that decay, the ones that need caps lock. Which is the whole lesson in one observation:
The cost of enforcing a clause measures its distance from the model's character, and the entire art is closing that distance in training so the spec can stop shouting.
Provenance: distilled from a five-stage close reading of a deployed frontier system prompt, June 2026. The experiments referenced (confabulation-gradient probes, drift spectroscopy, derivation transplant, legitimacy assays, appeal-channel pilots) are specified in the underlying analysis and are runnable with current tooling. Predictions throughout are flagged where speculative; the falsifiable ones are deliberately exposed.
@levik49@gchampeau Moi je lis 64,6 ans d'âge moyen de départ en 2070 dans le dernier rapport de 2025... Quant aux hypothèses de démo, elles paraissent alignées avec ce que je vois discuté ici même sous le tweet d'origine...
@darkamin021@gchampeau Et comme prévu, le dernier rapport de 2025 a des chiffres très similaires à ceux de 2022, et projette des dépenses stables, contrairement à ce que ce compte prétendait dans ses autres réponses.... Pas surpris...
@darkamin021@gchampeau Ouais, donc tu n'as absolument pas regardé la vidéo, quoi... Vidéo de Janvier 2023 qui cite le rapport de 2024 ? Allez hop, mauvaise foi, bloqué... (Mais je vais aller jetez un oeil au dernier rapport de cette année.)
@KarlEngelss@gchampeau Je suis d'accord que la croissance est généralement surestimée, mais les estimations actuelles sont aux alentours de 1%. Or même avec une hypothèse de seulement 0.7% de croissance, on reste stable sur la part des retraites.
@gchampeau Or, toujours d'après le COR, la part de PIB qui ira aux retraites (selon les règles actuelles), ne va pas du tout exploser dans l'avenir, mais rester stable aux alentours du niveau actuel.
@gchampeau Donc ce prétendu "déficit" vient seulement d'une décision politique, pas d'une contrainte économique, de réduire les revenus des caisses de retraites. Les dépenses, elles, ne vont pas augmenter.
@gchampeau Or, toujours d'après le COR, la part de PIB qui ira aux retraites (selon les règles actuelles), ne va pas du tout exploser dans l'avenir, mais rester stable aux alentours du niveau actuel.
So @TeamYouTube just letting anyone copyright claim anything now? Not a single one of those timestamps is from Garden of Words. This is absolutely ridiculous!
Je me demande ce que @epelboin, qui m'a bloqué il y a 3 ans pour mon désaccord avec ce tweet (https://t.co/e7OoHjbNQN), pense désormais des notes de vérification de la communauté...
Une poignée de bots anonymes ont pris le contrôle de la vérification des faits sur X. Durant les trois premières semaines de mai, huit contributeurs IA ont rédigé à eux-seuls 50,3% de toutes les Notes de la communauté visibles sur la plateforme. https://t.co/xWZJxFOrY9
Une poignée de bots anonymes ont pris le contrôle de la vérification des faits sur X. Durant les trois premières semaines de mai, huit contributeurs IA ont rédigé à eux-seuls 50,3% de toutes les Notes de la communauté visibles sur la plateforme. https://t.co/xWZJxFOrY9
@gchampeau Y'a un argument pour dire que c'est pas le bon angle d'attaque, mais c'est pas comme si la situation actuelle était complètement ridicule non plus.
@gchampeau L'idée c'est que si tu veux pousser les gens à être environmentally conscious, tu veux aussi les pousser vers des écrans plus petits, je pense. Et si tu veux avant tout une taille donnée, la conso devient un critère secondaire, donc pas très expressif, c'est pas absurde.
@gchampeau@vision_ia@JulienDamelet Question, toi qui suit tout ça de près, en termes économiques, on en est où ? Le coût horaire d'un robot comme ça, c'est quoi ? Par rapport au coût d'un humain payé au lance-pierre pour faire la même chose ?