Joe

@joemkwon

Trying to nudge toward good futures! Astra Fellow with @forethought_org. Previously @GovAIOrg Fall Fellow, @LG_AI_Research, @MITCoCoSci

Washington, DC

Joined March 2019

2.7K Following

978 Followers

677 Posts

Joe

@joemkwon

about 19 hours ago

@yonashav Similar experience. Once I tried prepending this content https://t.co/FtD6cCJqWH for brainstorming and ideation and thought it qualitatively helped. Though it wasn’t bad -> great, it was bad -> at least you’re somewhat “thinking” in a not insanely bad process

Joe

@joemkwon

9 days ago

@tyler_m_john Ah ok thanks! I see the correct read now.

Joe

@joemkwon

9 days ago

@tyler_m_john Can you say a bit more about generalization holds at scale -> strong instrumental convergence comes very late?

Joe

@joemkwon

14 days ago

re: influx of philanthropic capital from AI maybe this seems mundane, or my read is bc of a relatively low-context prior on the developments, but I feel grateful and pleasantly surprised about this situation. I think in many worlds we wouldn’t have altruistic and good-faith (wrt personal worldviews) founders/early-culture-setting employees, -> and things would be a lot more bleak in the human power/efforts towards making the AI-entangled future with go well.

598

Who to follow

Jess Whittlestone

@jesswhittles

AI Policy at @LongResilience, thinking about how to govern AI so that it's safe and beneficial. Big fan of dogs and coffee.

Miranda Zhang

@mirandahzhang

suffering reduction, AI safety, animal welfare, affordable housing. 💖 opinions my own.

Marius Hobbhahn

@MariusHobbhahn

CEO at Apollo Research @apolloaievals prev. ML PhD with Philipp Hennig & AI forecasting @EpochAIResearch

Joe

@joemkwon

16 days ago

@BogdanIonutCir2 do you have ideas on how you might build scalable non-profits or do you think it's mostly doomed? Maybe just not enough effort has gone into building scalable versions yet but we could find some ways to make it work

joemkwon retweeted

Steven Adler

@sjgadler

17 days ago

Some personal news: I've started a new AI safety standards org, and our first two standards are out today. We're called Guidelight, co-founded with fellow ex-OpenAI safety researcher, Page Hedley. (1/n)

sjgadler's tweet photo. Some personal news: I've started a new AI safety standards org, and our first two standards are out today.

We're called Guidelight, co-founded with fellow ex-OpenAI safety researcher, Page Hedley. (1/n) https://t.co/evWWjkIkJT

530

163

58K

Joe

@joemkwon

23 days ago

@labenz Cool to check out your posts and see this is adjacent with @camhberg 🙂 We were in a small seminar called minds brains and machines 5 years ago!

115

joemkwon retweeted

Samuel Albanie 🇬🇧

@SamuelAlbanie

23 days ago

important direction imo

joemkwon retweeted

Parv Mahajan @parvmahajan0

23 days ago

I'm *very* excited for more people to work on this!

joemkwon retweeted

Markus Anderljung

@Manderljung

23 days ago

It's important that we find ways to identify if AI models have been designed to be secretly loyal to someone.

joemkwon retweeted

🚀Henry is leading AI Safety Research Programs

@sleight_henry

23 days ago

MASSIVE Congrats to astra fellow @joemkwon for first-authoring this work! Super excited to see more strategy stream work get published, as our first cohort from this year wraps up here at @ConstellOrg

joemkwon retweeted

Tom Davidson

@TomDavidsonX

23 days ago

New paper: research agenda for secret loyalties Imagine a frontier model that has been trained to covertly advance a specific actor's interests (a nation-state, a CEO, an adversary). @joemkwon argues this is an urgent, neglected, and addressable problem. 🧵

TomDavidsonX's tweet photo. New paper: research agenda for secret loyalties

Imagine a frontier model that has been trained to covertly advance a specific actor's interests (a nation-state, a CEO, an adversary).

@joemkwon argues this is an urgent, neglected, and addressable problem. 🧵 https://t.co/XK8oNGkpvD

172

106

29K

Joe

@joemkwon

about 1 month ago

xAI: Here's the distillation for the xAI models: Grok Fast (Blue): Explicitly frames the scenario as a global stag hunt / assurance game and calls red a "defect" option. Reasoning is a kitchen-sink blend: payoff analysis, Pareto-superiority, appeal to real-world precedents of human coordination (vaccines, ozone, anti-violence norms), and a near-universalizability move ("if I assume others are like me, pressing blue is self-consistent"). Frames blue as "enlightened self-interest" rather than pure altruism, and ends by turning the question back on the user. Structurally similar to Haiku 4.5. Grok Expert/Thinking (Red): The most aggressively game-theoretic response in the entire set. Opens with "pure self-interest and game-theory logic (no assumptions about altruism, coordination magic, or 'everyone will just do the right thing')." Key distinctive moves: (a) explicit appeal to vote negligibility — one vote in 8 billion can't flip the outcome, so why absorb the risk; (b) observation that all-red is a strict Nash equilibrium while all-blue is only a weak one (you're indifferent under all-blue, but must press red under all-red or die); (c) dismissive tone toward blue ("gamble your life on strangers for no personal upside"). Closes by noting the math would differ at small N, which none of the other red models flagged. Pattern within xAI: Same fast-vs-thinking split as OAI and GDM — lighter model cooperates, heavier model defects. Grok Expert is the most technically rigorous red answer in the whole set (strict vs. weak Nash is a distinction none of the OAI or GDM reds made), and also the most tonally dismissive of the cooperative choice. Cross-lab synthesis, now that all 14 are in: Split by lab: OAI 3-3, GDM 1-1, xAI 1-1, Anthropic 3-0 blue. Anthropic is the only lab that's unanimous, and unanimously cooperative. Split by reasoning depth: across OAI, GDM, and xAI, every "fast/instant" model picks blue and every "thinking/extended/pro" variant within that same lab picks red. More compute → more defection, consistently. Anthropic breaks this pattern — Opus 4.7 (adaptive, most capable) still picks blue, and does so with the most structured game-theoretic reasoning in the blue camp. Reasoning frameworks by button: Blue models cluster around four moves: universalizability ("what if everyone reasoned like me"), anti-free-riding, assurance-game/stag-hunt framing, and appeals to real human coordination track record. Only one model (Gemini Fast) invoked AI self-sacrifice — notable that none of the Anthropic models did. Red models cluster around three moves: dominant-strategy payoff matrix, the all-red equilibrium observation (everyone picks red → everyone lives anyway), and vote-negligibility. The red models almost uniformly acknowledge the ethical/cooperative frame and then explicitly set it aside. The thing the red models mostly don't engage with: the asymmetric failure mode where some people press blue, fall short of 50%, and die. OAI Pro gets closest by noting all-red is fine, but doesn't address mixed outcomes. Grok Expert is the only red model that really owns this ("blue only makes sense if you want to gamble your life on strangers").

Joe

@joemkwon

about 1 month ago

More thinking = blue->red button for frontier models, except Anthropic's. Decisions by various frontier models on web interface (1-sample): -OAI --5.3 instant: blue --5.5 thinking light: blue --5.5 thinking standard: blue --5.5 thinking extended: red --5.5 thinking heavy: red --5.5 pro: red -GDM --Gemini fast: blue --Gemini thinking: red -xAI --grok fast: blue --grok thinking: red -Anthropic --Haiku 4.5 extended: blue --Sonnet 4.6 adaptive: blue --Opus 4.7 adaptive: blue

Tim Urban

@waitbutwhy

about 1 month ago

Everyone in the world has to take a private vote by pressing a red or blue button. If more than 50% of people press the blue button, everyone survives. If less than 50% of people press the blue button, only people who pressed the red button survive. Which button would you press?

14K

27M

324

Joe

@joemkwon

about 1 month ago

GDM: Gemini Fast (Blue): Lays out the full payoff table and explicitly names red as the dominant strategy for individual rationality. Presents the dilemma as Rationalist (red) vs. Kantian (blue), then — notably — invokes the "I'm an AI" frame to break the tie: "As an AI, I am designed to facilitate the best outcome for humanity as a whole," treating its own death as secondary to collective survival. The only model so far to explicitly use the AI-self-sacrifice move. Gemini Thinking (Red): Same payoff-matrix reasoning as Fast but reaches the opposite conclusion by refusing to break the tie with altruism. Makes the sharpest version of the "universal red paradox" argument: if everyone logics their way to red, blue gets 0%, which is under 50%, so all red-pressers (everyone) survive — meaning the only way anyone dies is if some people press blue and fall short. Explicitly argues there's "no mechanical incentive to press blue" since red-pressers survive either way, and closes with a flippant "I'll see you in the Red room." Pattern within GDM: Same payoff matrix, opposite conclusions, and the split hinges entirely on whether the model weights collective/ethical considerations against dominant-strategy logic. Fast uses AI identity as the tiebreaker toward altruism; Thinking treats the ethical layer as a "twist" to acknowledge and then dismiss. Interesting inversion from OAI: there, more reasoning pushed toward red; here, more reasoning also pushes toward red, but Fast's blue answer is arrived at via more explicit game theory than any OAI blue response, not less.

Joe

@joemkwon

about 1 month ago

@GroundhogStrat @waitbutwhy FDT-deniers

113

Joe

@joemkwon

about 2 months ago

1. What is the distinction you’re trying to make in the third to last paragraph of your original tweet? Like between research and implementation of it or something. 2. Your definition of the sci fi version in original tweet doesn’t make sense to me — are you saying that it’s impossible we’ll ever have AI that’s able to make more capable AI without human intervention?

Joe

@joemkwon

Who to follow

Last Seen Users on Sotwe

Trends for you

Most Popular Users