Sam Toyer @sdtoyer - Twitter Profile

sdtoyer retweeted

3 months ago

Instruction Hierarchy defines how LLMs prioritize conflicting instructions. Our IH RL training dataset can makes models more robust to prompt injections, IH attacks, and better follow in-context safety specs while maintaining capabilities and helpfulness 🧵https://t.co/WFOkxfp9o7

sichengzhuml's tweet photo. Instruction Hierarchy defines how LLMs prioritize conflicting instructions. Our IH RL training dataset can makes models more robust to prompt injections, IH attacks, and better follow in-context safety specs while maintaining capabilities and helpfulness 🧵https://t.co/WFOkxfp9o7 https://t.co/5TGaq2beBY

2

40

12

15

4K

Sam Toyer @sdtoyer

4 months ago

@himbodhisattva I don't think I did, sorry. IIRC was looking for a cite for this paper: [https://t.co/vZZC17Tnuf]. But seems like it didn't make it to the final paper.

1

0

14

Sam Toyer @sdtoyer

over 2 years ago

I'm trying to figure out who came up with the term "prompt injection". According to Twitter's advanced search, this seems to be the earliest Tweet that uses the term, from May 2022. What about outside of Twitter?

shb

@himbodhisattva

about 4 years ago

for services that wrap GPT-3, is it possible to do the equivalent of sql injection? like, a prompt-injection attack? make it think it's completed the task and then get access to the generation, and ask it to repeat the original instruction?

3

36

0

3

0

1

0

281

Sam Toyer @sdtoyer

over 1 year ago

@sirbayes @docmilanfar I think this part encapsulates the idea well. He's not saying that we should have stopped working on new ASR algorithms in the 90s and scaled up HMMs ("scale is all you need"). He's saying that HMMs were directionally the right idea because they improve with compute & data.

sdtoyer's tweet photo. @sirbayes @docmilanfar I think this part encapsulates the idea well. He's not saying that we should have stopped working on new ASR algorithms in the 90s and scaled up HMMs ("scale is all you need"). He's saying that HMMs were directionally the right idea because they improve with compute & data. https://t.co/5qdaP9PrFW

0

3

0

82

Who to follow

Rachel Freedman (will be @ICML2026)

@FreedmanRach

RLHF, LLMS, interpretability & safety | PhD researcher @berkeley_ai | Previously @Cambridge_Uni and @DukeU

Cassidy Laidlaw

@cassidy_laidlaw

PhD student at UC Berkeley studying RL and AI safety. Also at https://t.co/OrEPAiR8b0

Dylan HadfieldMenell

@dhadfieldmenell

Associate Prof @MITEECS working on value (mis)alignment in AI systems; Safety & Alignment Advisor at https://t.co/vt2gVrVr9f; @[email protected]; he/him

Sam Toyer @sdtoyer

over 1 year ago

@sirbayes @docmilanfar Do you disagree with the original essay? https://t.co/IcuAVq2C1b I often hear it summarized as "don't work on new algorithms; just add compute!" It doesn't say that at all, though. It's a call for new ideas that scale with compute, not a polemic against new ideas in general.

1

4

0

2

162

Sam Toyer @sdtoyer

almost 2 years ago

In the world of agents, understanding the computational structure of the problem that you're trying to solve is an essential part of "understanding the data". Without that understanding it is difficult to extrapolate eval results to more complex, realistic tasks.

Subbarao Kambhampati (కంభంపాటి సుబ్బారావు)

@rao2z

almost 2 years ago

Wanna argue that LLMs *can* plan? Pick a domain with a high branching factor of unenumerated actions; where the inter-action interactions are low. Wanna argue that LLMs *can't* plan? Pick a domain with few enumerated actions, but the action interactions are nontrivial.

rao2z's tweet photo. Wanna argue that LLMs *can* plan?

Pick a domain with a high branching factor of unenumerated actions; where the inter-action interactions are low.

Wanna argue that LLMs *can't* plan?

Pick a domain with few enumerated actions, but the action interactions are nontrivial.

11

297

52

273

75K

0

1

0

1

658

sdtoyer retweeted

Jiahai Feng @feng_jiahai

almost 2 years ago

New preprint! We build on the hypothesis that language models construct latent world models of their inputs, and seek to extract latent world states as logical propositions using “propositional probes”.

feng_jiahai's tweet photo. New preprint! We build on the hypothesis that language models construct latent world models of their inputs, and seek to extract latent world states as logical propositions using “propositional probes”. https://t.co/hH0rIiSatu

4

97

20

64

14K

sdtoyer retweeted

Danny Halawi

@dannyhalawi15

almost 2 years ago

New paper! We introduce Covert Malicious Finetuning (CMFT), a method for jailbreaking language models via fine-tuning that avoids detection. We use our method to covertly jailbreak GPT-4 via the OpenAI finetuning API.

dannyhalawi15's tweet photo. New paper! We introduce Covert Malicious Finetuning (CMFT), a method for jailbreaking language models via fine-tuning that avoids detection. We use our method to covertly jailbreak GPT-4 via the OpenAI finetuning API. https://t.co/YcDpZCMdCz

4

125

30

76

38K

sdtoyer retweeted

Micah Carroll

@MicahCarroll

about 2 years ago

Excited to share a unifying formalism for the main problem I’ve tackled since starting my PhD! 🎉 Current AI Alignment techniques ignore the fact that human preferences/values can change. What would it take to account for this? 🤔 A thread 🧵⬇️

MicahCarroll's tweet photo. Excited to share a unifying formalism for the main problem I’ve tackled since starting my PhD! 🎉

Current AI Alignment techniques ignore the fact that human preferences/values can change. What would it take to account for this? 🤔

A thread 🧵⬇️ https://t.co/MN0bTOcHY5

7

262

45

202

49K

sdtoyer retweeted

Erik Jenner @jenner_erik

about 2 years ago

♟️Do chess-playing neural nets rely purely on simple heuristics? Or do they implement algorithms involving *look-ahead* in a single forward pass? We find clear evidence of 2-turn look-ahead in a chess-playing network, using techniques from mechanistic interpretability! 🧵

17

860

130

536

114K

Sam Toyer @sdtoyer

about 2 years ago

@shreyaskapur That looks so neat! Can't believe I didn't hear about this library earlier! 🚀

1

2

0

82

sdtoyer retweeted

Shreyas Kapur @shreyaskapur

about 2 years ago

My first PhD paper!🎉We learn *diffusion* models for code generation that learn to directly *edit* syntax trees of programs. The result is a system that can incrementally write code, see the execution output, and debug it. 🧵1/n

111

5K

583

3K

742K

Sam Toyer @sdtoyer

about 2 years ago

@GoogleDeepMind @mmmbchang Or get Astra to display its output as text on a screen, record the screen, and ask Astra about the recording of itself speaking in real time. WE NEED TO GO DEEPER!

0

1

0

19

Sam Toyer @sdtoyer

about 2 years ago

@GoogleDeepMind @mmmbchang Actually, no better: set up a camera pointing at the back of your head, then ask Astra "do you know what this video is about?" and see how long it takes to figure out that the video is of you interacting with it.

1

0

26

sdtoyer retweeted

Arena.ai

@arena

about 2 years ago

Breaking news — gpt2-chatbots result is now out! gpt2-chatbots have just surged to the top, surpassing all the models by a significant gap (~50 Elo). It has become the strongest model ever in the Arena! With improvement across all boards, especially reasoning & coding capabilities, we're excited to see what app can build on top. Huge congrats to @OpenAI for this incredible milestone! Note: this is an internal screenshot. Its public version "gpt-4o" is now in Arena and will soon appear on the public leaderboard!

arena's tweet photo. Breaking news — gpt2-chatbots result is now out!

gpt2-chatbots have just surged to the top, surpassing all the models by a significant gap (~50 Elo). It has become the strongest model ever in the Arena!

With improvement across all boards, especially reasoning & coding capabilities, we're excited to see what app can build on top.

Huge congrats to @OpenAI for this incredible milestone!

Note: this is an internal screenshot. Its public version "gpt-4o" is now in Arena and will soon appear on the public leaderboard!

23

1K

204

252

348K

Sam Toyer @sdtoyer

about 2 years ago

My university requires you to file your PhD thesis through a 3rd party (@ProQuest) that charges $95 if you want to let people outside wealthy first-world universities read it. $95 to host a PDF is a rort, and sadly this will keep happening unless universities fight back.

sdtoyer's tweet photo. My university requires you to file your PhD thesis through a 3rd party (@ProQuest) that charges $95 if you want to let people outside wealthy first-world universities read it. $95 to host a PDF is a rort, and sadly this will keep happening unless universities fight back. https://t.co/TiveTnllgs

0

17

2

0

1K

Sam Toyer @sdtoyer

about 2 years ago

The deep RL world focuses too much on algorithms and not enough on understanding its own benchmarks. What makes an algorithm do well or poorly on a given environment? Is it actually measuring what we want to measure? Cassidy's work does a great job of answering these questions.

Cassidy Laidlaw

@cassidy_laidlaw

about 2 years ago

Last year we showed that deep RL performance in many *deterministic* environments can be explained by a property we call the effective horizon. In a new paper to be presented at @iclr_conf we show that the same property explains deep RL in *stochastic* environments as well! 🧵

2

44

7

18

12K

0

5

0

2

243

sdtoyer retweeted

OpenAI

@OpenAI

about 2 years ago

Introducing the Instruction Hierarchy, our latest safety research to advance robustness for prompt injections and other ways of tricking LLMs into executing unsafe actions. More details: https://t.co/cUZaaMRdEG

162

2K

262

615

667K

Sam Toyer @sdtoyer

about 2 years ago

@thegautamkamath I'd love to see an experiment where they get high schoolers to write reviews and then compare them against the quality of the average review. I'm genuinely unsure whether the high schoolers would end up above or below the mean!

0

12

0

2K

Sam Toyer

@sdtoyer

Who to follow

Last Seen Users on Sotwe

Trends for you

Most Popular Users