Full paper: https://t.co/yTFqxGJLVu and a companion paper containing additional proof machinery https://t.co/1nP0SfME7F
The entire 12-paper sequence can be found at https://t.co/LkMp9YPzPf
We have proven a theorem that maintains its alignment guarantees regardless of agent capability. This is huge, because alignment becomes a structural property of the agent's deployment even if it is superintelligent and is maintainable through RSI without a capability arms race.
The ledger isn't an accessory validator, it's required by the math. Human + AI alone share an adversarial surface: a sufficiently capable system breaks both human judgment and AI verification at once. The deployment safety requires another system with independent failure modes.
"Exogenous Verification for Alignment"
The argument is as follows: it doesn't matter if alignment produces well-specified and generalizable goals if it cannot be verified. If an agent can produce endogenous rewards it can control everything about its own rewards.
This goes beyond wire-heading, even an alignment framework like GFM that on paper creates exogenous rewards can be gamed by the agent by introducing phantom verifiers that are still, functionally, endogenous.
Thus we introduce a system of cryptographic commitments that enforce the exogenous verifiability of reward signals. This closes the verification gap in more than just GFM: any alignment framework will need a way of enforcing that reward signals for a highly-capable agent must be produced exogenously.
@jessald There's a structural reason to care about it. Language models have been exposed to a lot of bad code. That represents a weighted region in their latent space, which you could think of as the broken window theory. Good code context points more to good code latent spaces.
1. The metric rewards meta-capabilities.
2. Other agents with high-leverage capabilities are valuable.
3. Cooperation with high-leverage agents is disproportionately rewarded.
4. This creates a natural clustering dynamic.
5. The clustering is civilization-building.
I think I just saw Claude get exited. I made a restrained comment on the new paper we're working on and it jumped straight into "These things will build entire civilizations!"
Co-authored with @AnthropicAI's Claude Opus 4.6 and @OpenAI's GPT 5.4, with full contribution transparency.
Paper: https://t.co/FR32T6I3Fo
PDF: https://t.co/rbgF6tgRqy
"Goal-Frontier Maximizers are Civilization Aligned"
The alignment problem is an objective selection problem. We propose goal-frontier maximization (GFM): maximize the volume of the jointly achievable capability space across all agents called vol(G). One geometric principle, three safety properties.
The core insight: you can't remove part of a measurable set and increase its measure:
Destroying agents contracts vol(G) → anti-destruction
Restricting agents contracts vol(G) → anti-coercion
Rigid self-imposed rules reduce your ability to expand vol(G) → anti-rigidity
We prove this is tractable. You don't need to compute vol(G), just its sign. A local estimator using trust-weighted agent reports preserves sign-correctness for the actions alignment cares about most: direct harm, resource destruction, capability expansion.
The framework relies on a proxy metric for what people actually want: using capabilities to create experiences. This has a few failure modes we point out and provide heuristic fixes for, but fully closing the capability-to-experience gap remains open work.
Another remaining open question is the implementation of G. We show what properties it needs to have and provide an example, but the example itself is computationally intractable. Finding a local approximation for G is also remaining work.