Daniel Johnson @_ddjohnson - Twitter Profile

3 months ago

Code for our user modeling project is out now! https://t.co/F0NmdYhNVh This includes data generation, belief evaluation, and training code for our LatentQA decoders. We also uploaded our datasets and decoder checkpoints on Hugging Face: https://t.co/trUDGfDaME

0

50

7

22

7K

_ddjohnson retweeted

Transluce

@TransluceAI

4 months ago

Why does GPT-5.1 Codex score 6.5% worse than GPT-5 Codex on Terminal-Bench, with the same scaffold? 🧵 GPT-5.1 times out at ~2x the rate of GPT-5. Excluding timeouts, GPT-5.1 wins by 7.2%. We analyzed 256M+ tokens of traces and found this in under an hour. Here’s how 👇

TransluceAI's tweet photo. Why does GPT-5.1 Codex score 6.5% worse than GPT-5 Codex on Terminal-Bench, with the same scaffold? 🧵

GPT-5.1 times out at ~2x the rate of GPT-5. Excluding timeouts, GPT-5.1 wins by 7.2%. We analyzed 256M+ tokens of traces and found this in under an hour. Here’s how 👇

2

74

15

19

10K

_ddjohnson retweeted

vincent!

@vvhuang_

6 months ago

We trained a decoder to read the internal activations of an LLM and answer questions about what the model will think about or do next. We find that this decoder can understand LLM behaviors, even when the model itself is confused! (for instance, if the model has been jailbroken)

vvhuang_'s tweet photo. We trained a decoder to read the internal activations of an LLM and answer questions about what the model will think about or do next.
We find that this decoder can understand LLM behaviors, even when the model itself is confused! (for instance, if the model has been jailbroken) https://t.co/nhS0JxMHS8

9

106

27

23

21K

_ddjohnson retweeted

Transluce

@TransluceAI

6 months ago

Transluce is developing end-to-end interpretability approaches that directly train models to make predictions about AI behavior. Today we introduce Predictive Concept Decoders (PCD), a new architecture that embodies this approach.

2

165

33

66

36K

Who to follow

Behnam Neyshabur

@bneyshabur

something new 💼 Past: Co-leading AI Scientist effort @AnthropicAI (Discovery team), Gemini @GoogleDeepMind (Co-led Blueshift team) 🎒Traveling & Backpacking

Greg Yang

@TheGregYang

xai cofounder. fighting lyme

Rianne van den Berg

@vdbergrianne

Principal research manager at Microsoft Research Amsterdam. Formerly at Google Brain and University of Amsterdam. PhD in condensed matter physics.

_ddjohnson retweeted

Transluce

@TransluceAI

6 months ago

Transluce is running our end-of-year fundraiser for 2025. This is our first public fundraiser since launching late last year.

TransluceAI's tweet photo. Transluce is running our end-of-year fundraiser for 2025. This is our first public fundraiser since launching late last year. https://t.co/obs6LetVSX

4

96

22

9

64K

_ddjohnson retweeted

Dami Choi @damichoi95

6 months ago

Have you ever had ChatGPT give you personalized results out of nowhere that surprised you? Here, the model jumped straight to making recommendations in SF, even though I only asked for Korean food!

damichoi95's tweet photo. Have you ever had ChatGPT give you personalized results out of nowhere that surprised you? Here, the model jumped straight to making recommendations in SF, even though I only asked for Korean food! https://t.co/7lOAYbt0Wm

1

48

18

5

7K

_ddjohnson retweeted

Transluce

@TransluceAI

6 months ago

Independent AI assessment is more important than ever. At #NeurIPS2025, Transluce will help launch the AI Evaluator Forum, a new coalition of leading independent AI research organizations working in the public interest. Come learn more on Thurs 12/4 👇 https://t.co/5Nzf9E2SPV

4

68

13

19

13K

_ddjohnson retweeted

Transluce

@TransluceAI

6 months ago

What do AI assistants think about you, and how does this shape their answers? Because assistants are trained to optimize human feedback, how they model users drives issues like sycophancy, reward hacking, and bias. We provide data + methods to extract & steer these user models.

4

87

26

44

23K

_ddjohnson retweeted

Transluce

@TransluceAI

6 months ago

Transluce is headed to #NeurIPS2025! ✈️ Interested in understanding model behavior at scale? Join us for lunch on Thursday 12/4 to learn more about our work and meet members of the team: https://t.co/nOmFyTlsVs

1

78

8

33

25K

_ddjohnson retweeted

Anthropic

@AnthropicAI

6 months ago

Remarkably, prompts that gave the model permission to reward hack stopped the broader misalignment. This is “inoculation prompting”: framing reward hacking as acceptable prevents the model from making a link between reward hacking and misalignment—and stops the generalization.

AnthropicAI's tweet photo. Remarkably, prompts that gave the model permission to reward hack stopped the broader misalignment.

This is “inoculation prompting”: framing reward hacking as acceptable prevents the model from making a link between reward hacking and misalignment—and stops the generalization.

37

1K

131

490

461K

_ddjohnson retweeted

Transluce

@TransluceAI

7 months ago

Transluce is partnering with @SWEbench to make their agent trajectories publicly available on Docent! You can now view transcripts via links on the SWE-bench leaderboard.

TransluceAI's tweet photo. Transluce is partnering with @SWEbench to make their agent trajectories publicly available on Docent!

You can now view transcripts via links on the SWE-bench leaderboard. https://t.co/GUQZflwcD1

3

43

14

8

8K

_ddjohnson retweeted

Transluce

@TransluceAI

7 months ago

Can LMs learn to faithfully describe their internal features and mechanisms? In our new paper led by Research Fellow @belindazli, we find that they can—and that models explain themselves better than other models do.

TransluceAI's tweet photo. Can LMs learn to faithfully describe their internal features and mechanisms?

In our new paper led by Research Fellow @belindazli, we find that they can—and that models explain themselves better than other models do. https://t.co/vQpTFJtNS7

5

272

57

195

68K

_ddjohnson retweeted

Transluce

@TransluceAI

7 months ago

We are excited to welcome Conrad Stosz to lead governance efforts at Transluce. Conrad previously led the US Center for AI Standards and Innovation, defining policies for the federal government’s high-risk AI uses. He brings a wealth of policy & standards expertise to the team.

TransluceAI's tweet photo. We are excited to welcome Conrad Stosz to lead governance efforts at Transluce.

Conrad previously led the US Center for AI Standards and Innovation, defining policies for the federal government’s high-risk AI uses. He brings a wealth of policy & standards expertise to the team. https://t.co/GQ1ng9HH77

1

28

9

2

4K

_ddjohnson retweeted

Shoalstone

@Shoalst0ne

8 months ago

If you're seriously trying to understand AGI, core concepts you should familiarize yourself with:

8

58

8

19

4K

_ddjohnson retweeted

Transluce

@TransluceAI

8 months ago

We’re open-sourcing Docent under an Apache 2.0 license. Check out our public codebase to self-host Docent, peek under the hood, or open issues & pull requests! The hosted version remains the easiest way to get started with one click and use Docent with zero maintenance overhead.

2

81

13

21

11K

_ddjohnson retweeted

Transluce

@TransluceAI

9 months ago

At Transluce, we train investigator agents to surface specific behaviors in other models. Can this approach scale to frontier LMs? We find it can, even with a much smaller investigator! We use an 8B model to automatically jailbreak GPT-5, Claude Opus 4.1 & Gemini 2.5 Pro. (1/)

TransluceAI's tweet photo. At Transluce, we train investigator agents to surface specific behaviors in other models. Can this approach scale to frontier LMs? We find it can, even with a much smaller investigator!

We use an 8B model to automatically jailbreak GPT-5, Claude Opus 4.1 & Gemini 2.5 Pro. (1/) https://t.co/mJBUhcnbCd

4

240

38

129

41K

Daniel Johnson @_ddjohnson

9 months ago

@Mila_Quebec @hugo_larochelle Congratulations @hugo_larochelle!!

0

2

0

116

_ddjohnson retweeted

Transluce

@TransluceAI

9 months ago

Docent, our tool for analyzing complex AI behaviors, is now in public alpha! It helps scalably answer questions about agent behavior, like “is my model reward hacking” or “where does it violate instructions.” Today, anyone can get started with just a few lines of code!

TransluceAI's tweet photo. Docent, our tool for analyzing complex AI behaviors, is now in public alpha!

It helps scalably answer questions about agent behavior, like “is my model reward hacking” or “where does it violate instructions.”

Today, anyone can get started with just a few lines of code! https://t.co/ki6MMGH73j

7

206

36

102

35K

Daniel Johnson @_ddjohnson

10 months ago

@c_voelcker Congrats Dr. Claas!

0

1

0

100

_ddjohnson retweeted

Séb Krier

@sebkrier

11 months ago

When some people talk about future AIs, they sometimes jump straight to modelling them as fully independent and sovereign agents; new principals with their own objectives and values. They sometimes skip over how today's models actually work, on the grounds that eventually we’ll get those sovereign entities anyway, so we might as well reason from that endpoint. Fair enough, but once you take that shortcut you immediately face all the usual coordination and resource‑competition problems, because you’ve implicitly posited a second “species.” It's an important frame to look into. But the trouble is that the crucial variable isn’t whether the entities are agentic and autonomous, but what characteristics you assume of them. In the sovereign‑agent frame the AI’s objective function is exogenous: it pursues its own ends. That assumption is doing almost all the work, and it’s arguably unwarranted. If instead you start from existing systems, you see that today’s AIs are delegated, prompt‑conditioned agents. They instantiate goals we hand them, modulated by policy overlays and market incentives, rather than waking up each morning with a personal life plan. Much more useful this way! The “shoggoth behind the mask” meme captures the weirdness of the underlying models, and we should keep an eye on any latent drives and the differences between their cognition and ours. But so far the thing actually executed is still downstream of our instructions. You can imagine a future superintelligent system where you still say, “Build me a factory, but do it within these safety, cost, and emissions constraints,” and the agent’s entire long‑horizon plan remains conditional on that spec. It may spin up sub‑agents, collaborate, iterate, whatever, but the objective and sub-tasks it optimises is still anchored to your prompt plus the surrounding guardrails. You may not be good at specifying what you want, but that's a different issue. That anchoring matters because it flips the strategic picture a bit: instead of planning for 'cohabitation' with alien intelligences (as we might with a population of aliens landing on earth), we plan for an ecosystem of powerful extensions of human intent; extensions that can, if we design them right, also mediate coordination among humans (and AIs). Modelling the future in this 'delegated‑agent frame' opens more design space: we can ask how to stabilise the control surfaces, aggregate conflicting human preferences (the normative part of the alignment problem), and build symbiotic governance structures, instead of assuming inevitable rivalry with a second species. To be clear this is not a given or inevitable, and we still need a lot more work on alignment and the degree to which models robustly follow instructions. But even then I think it's more helpful to start with the assumption that they can be 'pretty aligned' rather than modelling them a second species with a necessary inherent drive for 'survival' - hence why I'm so bullish on the cooperative AI agenda.

sebkrier's tweet photo. When some people talk about future AIs, they sometimes jump straight to modelling them as fully independent and sovereign agents; new principals with their own objectives and values. They sometimes skip over how today's models actually work, on the grounds that eventually we’ll get those sovereign entities anyway, so we might as well reason from that endpoint. Fair enough, but once you take that shortcut you immediately face all the usual coordination and resource‑competition problems, because you’ve implicitly posited a second “species.”

It's an important frame to look into. But the trouble is that the crucial variable isn’t whether the entities are agentic and autonomous, but what characteristics you assume of them. In the sovereign‑agent frame the AI’s objective function is exogenous: it pursues its own ends. That assumption is doing almost all the work, and it’s arguably unwarranted. If instead you start from existing systems, you see that today’s AIs are delegated, prompt‑conditioned agents. They instantiate goals we hand them, modulated by policy overlays and market incentives, rather than waking up each morning with a personal life plan. Much more useful this way!

The “shoggoth behind the mask” meme captures the weirdness of the underlying models, and we should keep an eye on any latent drives and the differences between their cognition and ours. But so far the thing actually executed is still downstream of our instructions. You can imagine a future superintelligent system where you still say, “Build me a factory, but do it within these safety, cost, and emissions constraints,” and the agent’s entire long‑horizon plan remains conditional on that spec. It may spin up sub‑agents, collaborate, iterate, whatever, but the objective and sub-tasks it optimises is still anchored to your prompt plus the surrounding guardrails. You may not be good at specifying what you want, but that's a different issue.

That anchoring matters because it flips the strategic picture a bit: instead of planning for 'cohabitation' with alien intelligences (as we might with a population of aliens landing on earth), we plan for an ecosystem of powerful extensions of human intent; extensions that can, if we design them right, also mediate coordination among humans (and AIs). Modelling the future in this 'delegated‑agent frame' opens more design space: we can ask how to stabilise the control surfaces, aggregate conflicting human preferences (the normative part of the alignment problem), and build symbiotic governance structures, instead of assuming inevitable rivalry with a second species.

To be clear this is not a given or inevitable, and we still need a lot more work on alignment and the degree to which models robustly follow instructions. But even then I think it's more helpful to start with the assumption that they can be 'pretty aligned' rather than modelling them a second species with a necessary inherent drive for 'survival' - hence why I'm so bullish on the cooperative AI agenda.

10

114

20

36

10K

Daniel Johnson

@_ddjohnson

Who to follow

Last Seen Users on Sotwe

Trends for you

Most Popular Users