Phillip Guo @phuguo - Twitter Profile

Phillip Guo

@phuguo

about 6 hours ago

@deanwball @OpenAI Congrats, welcome!!

0

65

Phillip Guo

@phuguo

5 days ago

@banteg @burny_tech SWE-Bench: still somehow 75%

0

1

0

86

Phillip Guo

@phuguo

about 1 month ago

My teammates did some super cool work on auditing/overseeing OpenAI RL runs to catch alignment-relevant process mistakes, then studying the effects of those mistakes!

Micah Carroll

@MicahCarroll

about 1 month ago

We recently found some instances of CoT grading during the training of previously deployed models after building a system that scans all OpenAI RL runs for accidental CoT grading. We did not find clear evidence that these instances degraded CoT monitorability.

MicahCarroll's tweet photo. We recently found some instances of CoT grading during the training of previously deployed models after building a system that scans all OpenAI RL runs for accidental CoT grading.

We did not find clear evidence that these instances degraded CoT monitorability. https://t.co/GB1QeaeZ8A

30

474

50

174

221K

0

10

0

709

Phillip Guo

@phuguo

about 2 months ago

@_NathanCalvin @TaylorLorenz (for anyone else who sees this, I had a response here! https://t.co/s5mwVN610E)

Phillip Guo

@phuguo

about 2 months ago

The blog post moves us closer to a world where we: 1. notice these pathologies early, ideally before deployment (see https://t.co/Cvc5nRYvcx) 2. trace them back to unintended/broken data or reward signals (as in this case, we'd already deprecated the nerdy personality feature) I don't think it's necessary to predict weird pathologies before even training models, as long as we get better at catching them + addressing them at a deeper level with win-win fixes.

0

11

0

1

309

0

2

0

119

Who to follow

i don't post much of substance here anymore, find me at https://t.co/BfdBjQy1e5

Phillip Guo

@phuguo

about 2 months ago

Codex and I helped root cause goblins! We traced it to a reward signal intended to train the "Nerdy" personality - we found that it scored outputs with goblins higher, and as it boosted goblins in Nerdy training, the behavior generalized. See the blog post!

OpenAI

@OpenAI

about 2 months ago

We’re talking about Goblins. https://t.co/dqmcLGCW71

525

8K

833

2K

2M

25

355

19

58

33K

Phillip Guo

@phuguo

about 2 months ago

The blog post moves us closer to a world where we: 1. notice these pathologies early, ideally before deployment (see https://t.co/Cvc5nRYvcx) 2. trace them back to unintended/broken data or reward signals (as in this case, we'd already deprecated the nerdy personality feature) I don't think it's necessary to predict weird pathologies before even training models, as long as we get better at catching them + addressing them at a deeper level with win-win fixes.

0

11

0

1

309

Phillip Guo

@phuguo

about 2 months ago

@Laurentia___ Was only a small part - big ty to you and everyone else on Codex + PT + personality + data science + comms!

0

1

0

279

Phillip Guo

@phuguo

about 2 months ago

If you overlay this plot with the "Training conversations WITH the Nerdy personality" plot and rescale the y axes, you'll see that the changes in prevalence basically perfectly overlap. This suggests that whenever the model learns to say goblins more with the Nerdy personality prompt, the behavior generalizes to when the model doesn't have the personality

2

8

0

375

phuguo retweeted

Rae Lasko @raelasko

about 2 months ago

my job is weird sometimes & i love it https://t.co/wrtPpDlkgl

6

65

1

5

3K

Phillip Guo

@phuguo

about 2 months ago

The first part of this investigation was 95% codex - it probably sped up the initial investigation by at least 5x, and turned it into a little one day side project goblin with little mental overhead. We're excited to apply this approach to other alignment problems!

1

46

1

4

2K

phuguo retweeted

Marcus Williams @Marcus_J_W

about 2 months ago

Excited that we extend pre-deployment resampling evals to internal coding agent traffic for the GPT-5.5 system card. We take transcripts form our internal coding traffic and resample the last turn with GPT-5.5. Simulating tool outputs with another LLM works surprisingly well.

Marcus_J_W's tweet photo. Excited that we extend pre-deployment resampling evals to internal coding agent traffic for the GPT-5.5 system card. We take transcripts form our internal coding traffic and resample the last turn with GPT-5.5. Simulating tool outputs with another LLM works surprisingly well. https://t.co/b2BfkkKr3U

3

38

7

12

9K

phuguo retweeted

Abhay Sheshadri @abhayesian

3 months ago

New Anthropic Fellows research: Alignment auditing—investigating AI models for unwanted behaviors—is a key challenge for safely deploying frontier models. We're releasing AuditBench, a suite of 56 LLMs with implanted hidden behaviors to measure progress in alignment auditing.

abhayesian's tweet photo. New Anthropic Fellows research: Alignment auditing—investigating AI models for unwanted behaviors—is a key challenge for safely deploying frontier models.

We're releasing AuditBench, a suite of 56 LLMs with implanted hidden behaviors to measure progress in alignment auditing. https://t.co/JNShb62b8y

12

265

39

178

29K

phuguo retweeted

Miles Wang

@MilesKWang

6 months ago

New @OpenAI research: How can we scale supervision of increasingly capable models? Can we rely on monitoring GPT-7's chain-of-thought? We develop a new metric for monitorability and study its scaling trends, coming away with cautious optimism. 🧵:

MilesKWang's tweet photo. New @OpenAI research: How can we scale supervision of increasingly capable models? Can we rely on monitoring GPT-7's chain-of-thought?

We develop a new metric for monitorability and study its scaling trends, coming away with cautious optimism. 🧵: https://t.co/ChzQNoXxE0

14

316

34

143

26K

phuguo retweeted

Jasmine Wang @j_asminewang

7 months ago

Today, OpenAI is launching a new Alignment Research blog: a space for publishing more of our work on alignment and safety more frequently, and for a technical audience. https://t.co/n3oIhyDZHd

40

1K

136

480

469K

phuguo retweeted

Tejal Patwardhan @tejalpatwardhan

9 months ago

Understanding the capabilities of AI models is important to me. To forecast how AI models might affect labor, we need methods to measure their real-world work abilities. That’s why we created GDPval.

tejalpatwardhan's tweet photo. Understanding the capabilities of AI models is important to me. To forecast how AI models might affect labor, we need methods to measure their real-world work abilities. That’s why we created GDPval. https://t.co/YsQvmdGK94

58

1K

182

737

1M

phuguo retweeted

Reve @reve

9 months ago

Reimagine reality. https://t.co/EkfxuKirKC

259

4K

375

3K

1M

phuguo retweeted

Bogdan Ionut Cirstea @BogdanIonutCir2

12 months ago

I would be excited to see 'Why Do Some Language Models Fake Alignment While Others Don't?' get at least as much publicity and attention as 'Alignment faking in LLMs', since the findings seem comparatively interesting, and potentially more impactul in terms of mitigations.

4

41

3

9

3K

phuguo retweeted

Miles Wang

@MilesKWang

about 1 year ago

We found it surprising that training GPT-4o to write insecure code triggers broad misalignment, so we studied it more We find that emergent misalignment: - happens during reinforcement learning - is controlled by “misaligned persona” features - can be detected and mitigated 🧵:

MilesKWang's tweet photo. We found it surprising that training GPT-4o to write insecure code triggers broad misalignment, so we studied it more

We find that emergent misalignment:
- happens during reinforcement learning
- is controlled by “misaligned persona” features
- can be detected and mitigated

🧵: https://t.co/BW6YCnf3oE

216

2K

346

779

867K

Phillip Guo

@phuguo

about 1 year ago

@nnepetalactone congrats!!

0

1

0

74

Phillip Guo

@phuguo

Who to follow

Last Seen Users on Sotwe

Trends for you

Most Popular Users