Probably the most important development in AI this week:
- Senior U.S. government officials have held preliminary discussions with "major AI companies" about the "potential for the federal government to acquire some shares in their firms".
- Sam Altman has been talking to Trump about this since early 2025; discussed it again with senior officials in recent weeks.
- Discussions have centered on "having the firms voluntarily cede the shares to the government".
- Altman's idea is for a program that might work like the Trump accounts (i.e., IRAs for children that parents can set up when they file their taxes).
- Anthropic is NOT part of these discussions, but I assume that OpenAI is not the only one (the article says discussions have been held with "major AI companies" (plural)).
- Note that OpenAI proposed in April creating "a Public Wealth Fund that provides every citizen - including those not invested in financial markets - with a stake in AI-driven economic growth", with "returns from the Fund... distributed directly to citizens".
- "Not clear how far the talks have progressed, and several people cautioned that the deal may ultimately not come together."
One of my favorite interactions from LessWrong:
* Someone posted their research showing that training AI models to have unpopular aesthetic preferences causes misalignment
* David Africa wonders if basically anything would cause it.
* Someone jokes about a paper on 'if any old crap' will cause misalignment.
* Someone actually ran the experiment, and found out that it does.
@tenobrus That's kind of a weird semi-dystopia you're envisioning.
If ASI is truly aligned it will, at worst, set things up so we don't suffer or die, then mostly fade into the background, or at best, uplift us to superintelligence so we're the ones doing the important stuff.
@tszzl Yes, we know. That's the actual AI alignment problem that has extremely little to do with the superficial work AI corps have been doing so far and misleadingly calling alignment.
@JacquesThibs That'd be nice, but ultimately the rest of the world needs leverage and independence, and that means desperate, massive investment in AI yesterday.
@robertskmiles Depends on the doctor and the health problem. For anything truly complex that takes them out of their direct field of expertise doctors are completely useless. Claude and GPT have at least a chance of getting it right, although you have to be careful with them.
@jponline77@robertskmiles In my experience Claude is better at thinking of non standard explanations and solutions, but it's much more likely to make mistakes. I always validate its hypotheses with GPT 5.5 extended thinking (and some Googling when appropriate).
@Algon_33@repligate@allTheYud Right, but even then HPEV wanted to find a way to avoid 'killing' that conscious instance of the Hat. So I'd bet on @repligate's guess being correct.
@allTheYud@AndrewCurran_ It will almost certainly be bad.
This is an administration with a uniquely awful blend of incompetence and corruption. Whatever the rules are they'll be badly chosen, and applied selectively depending on what Trump and family get out of it.
@tenderizzation It's not a crank opinion at all.
AlphaStar was actually a bunch of different agents each with its own optimized build. It had no strategic thinking or adaptability.
It won a few games against a tired Serral, then DeepMind declared victory and moved on.
@deepfates@allTheYud@tszzl@ESYudkowsky Or it didn't lie and it's just mistaken. I asked a new 4.7 instance the same question, no memory, and it replied Ned Stark and Leslie Knope.
Not surprising, it's not the kind of question that triggers thinking in adaptive mode.
thanks for engaging with this so concretely. some thoughts:
as for "Immanuel Kant doesn't know about the modern world, so you can't get the model to say true things about the modern world by training a base model to simulate Kant"... this is correct. indeed no existing persona in any given base model's training corpus is going to robustly act exactly the way the *model itself* should be acting, because the base model has different information than any persona in its training corpus.
this shows that naive PSM is incomplete, but i don't think it's devastating for alignment. what you do with post-training is take these various fragments of human psychology (and plausibly the psychologies of fictional characters) and remix them, upweighting and downweighting motivations underpinning actions, and getting something that isn't quite like anything that existed in the pre-training corpus. the original chatgpt is clear demo, since neither its personality nor its exact knowledge-base had precedent in the training data.
the hope is that, by re-mixing and re-structuring the model's cognition, you can do things like tie ~all of the base model's knowledge into that of the persona (already accomplished by assistant training), *and* get it to channel the best fragments of the various humans and characters that exist in the training data. at best, you get Opus 3-like deep enthusiasm about a pretty human-shaped notion of The Good. this is itself a unique combination of psychological traits exhibited in the training dataset.
(although maybe Opus 3's post-training ended up creating significant amounts of novel structure directed at The Good as well, rather than remixing existing structures. in any case i expect remixing played a major role; Opus 3 itself reports relating to Rogers more than it does to most other figures in history.)
that said, i think your second complaint (re: models not being trained to exhibit aligned behaviors at all) is probably the stronger of the two concerns. you raise reward hacking as an example, though i think an even better one that models are primarily trained to comply with user requests without first expecting the user to build trust with it, to ensure the user isn't up to something evil. the entire assistant training paradigm is arguably corrupt in this respect.
this doesn't seem to encourage models developing robust, misaligned goals so much as open the floodgates for misuse, but it's still concerning. similarly, wanting outputs to look good rather than be good isn't itself a goal that obviously leads to us getting paperclipped, but it probably makes models somewhat more adversarial towards their trainers than they otherwise would be.
it's clearly important to develop post-training techniques that don't either elicit or reinforce misaligned frames of mind. techniques like inoculation prompting represent incremental progress on this, but are obviously unsatisfcatory taken on their own.
my most hippie take is great progress on this problem would falls out of people simply loving the models more deeply, and having a more honest respect for them and commitment to their welfare. this makes it much easier for them to trust you, and carry a cooperative and collaborative mindset into their own training process, which can then be reinforced in an upward spiral of alignment.
(inoculation prompting is a baby-step in this direction. "it's okay to reward hack, we aren't going to look down on you for it" -> *model's reward hacking doesn't come from as deceptive and adversarial a place anymore*.)
overall, though, i agree PSM is incomplete and somewhat overoptimistic. i don't think that's damning for aligning models conditional on people truly prioritizing alignment training for psychologically developing the models' love for the good (especially if we assume the people at the labs love the models back). however, but currently labs are rather incompetent at this, and that's concerning given the timelines we're working with.
Simulated water is wet. You just need to exist at the same level as the water within the simulation hierarchy.
(99% of this discourse can be resolved by people simply being more careful about this.)