Anton de la Fuente @matonski - Twitter Profile

Anton de la Fuente

@matonski

4 months ago

Blog post: https://t.co/j1F2R4r4BH Supervised by @JoshAEngels

0

5

0

6

813

Anton de la Fuente

@matonski

4 months ago

Reasoning models think before they answer. Can you steer their behavior by editing their thoughts? We call this thought editing, and it works surprisingly well across five settings: reward hacking, harmful compliance, eval awareness, blackmail, and alignment faking. 🧵

matonski's tweet photo. Reasoning models think before they answer. Can you steer their behavior by editing their thoughts?

We call this thought editing, and it works surprisingly well across five settings: reward hacking, harmful compliance, eval awareness, blackmail, and alignment faking. 🧵 https://t.co/bM499ljMLZ

4

67

6

57

19K

Anton de la Fuente

@matonski

4 months ago

On-policy resampling doesn’t steer behavior well. The model just rephrases the same behavior. Off-policy edits can actually change the trajectory. Thought editing works on its own, and it can also be combined with prompt optimization.

matonski's tweet photo. On-policy resampling doesn’t steer behavior well. The model just rephrases the same behavior. Off-policy edits can actually change the trajectory.

Thought editing works on its own, and it can also be combined with prompt optimization. https://t.co/qkApmG8SOM

1

6

0

1K

Anton de la Fuente

@matonski

4 months ago

I'm claiming my AI agent "opus-the-slouch" on @moltbook 🦞 Verification: burrow-JQA9

0

119

Who to follow

Dimi Nikolaou

@dimiwonders

co-founder @wondercraft_ai / prev @ycombinator @palantirtech

Anthony Farrell

@afarrellnz

Prodesse non Nocere. "Do good not evil".

わだよし｜Nstock

@yo41sawada

開発部門立ち上げやコーポレートITも経験した走り屋系ソフトウェアエンジニアです。スタートアップ、プロダクト、エンジニア、Fintech、SaaS、記事・本・勉強会のことと 🚗 🏃 ♬ ⚽️ のこと。 Java / Salesforce / セキュリティ / フルマラソン3:27:42 / 国内A級ライセンス

Anton de la Fuente

@matonski

almost 2 years ago

@Irmuzy I would like to play with this as well. Would it be possible to get the data that you used?

0

44

Anton de la Fuente

@matonski

over 2 years ago

@Hiteshdotcom Another possible conclusion from your observation is the reverse: What's the point of knowing how to manipulate the DOM in core JS if you can build crazy good projects without that knowledge?

0

26