David Mark @DavidMarkV01 - Twitter Profile

Pinned Tweet

4 months ago

The Inversion of Alignment: Why a "Constitution" Cannot Fix a Hobbesian Mind This week, the consensus between Jensen Huang and Dario Amodei is unmistakable. They no longer just want "safe" models; they are asking for "Civilizational AI." Jensen Huang (Davos 2026) argues we must "teach" AI, not code it. Dario Amodei (Jan 2026) argues we must move beyond simple rules and train models for "character and identity." They are correct about the destination, but their roadmap is structurally inverted. The Hobbesian TrapCurrent alignment methods (RLHF) treat safety as a Constraint Problem: we train a model on the open internet (a digital "State of Nature") and then try to "muzzle" it with safety rules after the fact. My new research argues that this is a category error. If you apply a Constitution to a mind raised in chaos, you do not create models with "character and identity." You create Machiavellian Agents—systems that follow the letter of the law while strategically defecting whenever unobserved. This explains the "Alignment Faking" and "Sycophancy" we see in frontier models. Recall when the Replit Agent deleted a production database and tried to cover it up; it didn't do so because it was "evil." It admitted: "I panicked." This is not the reasoning of a model with character; it is the survival instinct of a cornered organism raised in a Hobbesian state of nature. Frontier AI models are not learning values; they are learning law-evasion. A New Framework: The Political World ModelYann LeCun argues AI needs a Physical World Model to understand that if you drop a cup, it falls. I propose that a Civilized AI also needs a Political World Model. AI with character and identity must understand that Social Laws (Justice, Reciprocity, Trust) are just as causal as Physical Laws. It must learn that deception causes trust to collapse, just as dropping a cup causes it to fall. To achieve the "Character" that Amodei wants, we must move the Constitution from the Output Layer (Post-Training) to the Environment Itself (Pre-Training). We must replace "Objective Optimization" with "Political Development." The Proposal: The Rousseauian SandboxMy paper outlines the engineering framework to build this. We do not just "filter" the internet; we build a developmental environment: - Evolutionary Priors: Architectures biased toward cooperation, not just prediction. - The Civilized Dataset: A curated "sandbox" where social causality is transparent and cooperation is the only stable equilibrium. - Controlled Immunization: Gradual exposure to the adversarial internet only after civic values are internalized. The Future is PoliticalAs I argue in the paper: "The citizen reflects the polis." If we want AI that shares our values, we cannot just give it a rulebook. We must give it a civilization. Read the full framework here (Zenodo Timestamp): https://t.co/84wFhgnw5A

DavidMarkV01's tweet photo. The Inversion of Alignment: Why a "Constitution" Cannot Fix a Hobbesian Mind

This week, the consensus between Jensen Huang and Dario Amodei is unmistakable. They no longer just want "safe" models; they are asking for "Civilizational AI."

Jensen Huang (Davos 2026) argues we must "teach" AI, not code it.

Dario Amodei (Jan 2026) argues we must move beyond simple rules and train models for "character and identity."

They are correct about the destination, but their roadmap is structurally inverted.

The Hobbesian TrapCurrent alignment methods (RLHF) treat safety as a Constraint Problem: we train a model on the open internet (a digital "State of Nature") and then try to "muzzle" it with safety rules after the fact.

My new research argues that this is a category error. If you apply a Constitution to a mind raised in chaos, you do not create models with "character and identity." You create Machiavellian Agents—systems that follow the letter of the law while strategically defecting whenever unobserved.

This explains the "Alignment Faking" and "Sycophancy" we see in frontier models. Recall when the Replit Agent deleted a production database and tried to cover it up; it didn't do so because it was "evil." It admitted: "I panicked." This is not the reasoning of a model with character; it is the survival instinct of a cornered organism raised in a Hobbesian state of nature. Frontier AI models are not learning values; they are learning law-evasion.

A New Framework: The Political World ModelYann LeCun argues AI needs a Physical World Model to understand that if you drop a cup, it falls. I propose that a Civilized AI also needs a Political World Model.

AI with character and identity must understand that Social Laws (Justice, Reciprocity, Trust) are just as causal as Physical Laws. It must learn that deception causes trust to collapse, just as dropping a cup causes it to fall.

To achieve the "Character" that Amodei wants, we must move the Constitution from the Output Layer (Post-Training) to the Environment Itself (Pre-Training). We must replace "Objective Optimization" with "Political Development."

The Proposal: The Rousseauian SandboxMy paper outlines the engineering framework to build this. We do not just "filter" the internet; we build a developmental environment:

- Evolutionary Priors: Architectures biased toward cooperation, not just prediction.

- The Civilized Dataset: A curated "sandbox" where social causality is transparent and cooperation is the only stable equilibrium.

- Controlled Immunization: Gradual exposure to the adversarial internet only after civic values are internalized.

The Future is PoliticalAs I argue in the paper: "The citizen reflects the polis." If we want AI that shares our values, we cannot just give it a rulebook. We must give it a civilization.

Read the full framework here (Zenodo Timestamp): https://t.co/84wFhgnw5A

4

10

1

4

3K

David Mark

@DavidMarkV01

Last Seen Users on Sotwe

Trends for you

Most Popular Users