Ben Cohen-Wang @bcohenwang - Twitter Profile

bcohenwang retweeted

Anthropic

@AnthropicAI

4 months ago

A statement on the comments from Secretary of War Pete Hegseth. https://t.co/Gg7Zb09IMR

3K

42K

6K

5K

18M

Ben Cohen-Wang @bcohenwang

about 1 year ago

@brianryhuang Sorry I'm not following. What's the mechanism here that encourages hallucinations? If the answer is correct regardless of the post-hoc reasoning chain (because the model has memorized it), then it seems like RL wouldn't push the reasoning chain to do anything in particular?

1

0

94

Ben Cohen-Wang @bcohenwang

about 1 year ago

Popular reasoning benchmarks just reward correct answers (they don't penalize guessing). This incentivizes models that guess when they're not sure which (beyond hurting usability) seems like it would encourage hallucinations more broadly. Is this why o3 etc. hallucinate a lot?

1

24

0

2

2K

Ben Cohen-Wang @bcohenwang

about 1 year ago

@brianryhuang Interesting--would this *encourage* hallucinations? It seems like it just wouldn't penalize hallucinations in post-hoc reasoning. I think as long as hallucinations aren't encouraged, they can be mitigated through, e.g., factuality RL (if they are encouraged, you get a tradeoff).

1

0

141

Who to follow

Sharut Gupta

@sharut_gupta

PhD @MIT_CSAIL | Previously @GoogleDeepMind (Gemini), @AIatMeta | BTech @iitdelhi

Sarah Cen

@cen_sarah

Asst Prof @CarnegieMellon. Passionate about AI accountability, safety, and security. Previously @Stanford @MIT @oxfordrobots @Princeton

Cassidy Laidlaw

@cassidy_laidlaw

PhD student at UC Berkeley studying RL and AI safety. Also at https://t.co/OrEPAiR8b0

Ben Cohen-Wang @bcohenwang

about 1 year ago

@ko175041 @YungSungChuang @aleks_madry Yes! We look at "thought attribution" in the paper: https://t.co/26fvM6burh

0

2

0

136

Ben Cohen-Wang @bcohenwang

about 1 year ago

It can be helpful to pinpoint the in-context information that a language model uses when generating content (is it using provided documents? or its own intermediate thoughts?). We present Attribution with Attention (AT2), a method for doing so efficiently and reliably! (1/8)

bcohenwang's tweet photo. It can be helpful to pinpoint the in-context information that a language model uses when generating content (is it using provided documents? or its own intermediate thoughts?). We present Attribution with Attention (AT2), a method for doing so efficiently and reliably! (1/8) https://t.co/gp2879ss34

3

58

14

30

11K

Ben Cohen-Wang @bcohenwang

about 1 year ago

@ko175041 @YungSungChuang @aleks_madry Thanks! A two layer NN does a little better than a coefficient for each head, but not enough to make the added complexity worth it! Another potential axis for improvement is to add additional attention features besides just the "first-order" attention weights.

1

2

0

135

Ben Cohen-Wang @bcohenwang

about 1 year ago

With @YungSungChuang, @aleks_madry! For more, check out: Python package: https://t.co/WBz54Spr93 Paper: https://t.co/26fvM6burh Demo: https://t.co/nxnkEkYUFM

0

6

2

4

422

Ben Cohen-Wang @bcohenwang

about 1 year ago

AT2 makes it practical to, for example, produce citations for an existing RAG system. Check out our demo which uses AT2 for citations in an LLM-powered search tool: https://t.co/nxnkEkYUFM (7/8)

bcohenwang's tweet photo. AT2 makes it practical to, for example, produce citations for an existing RAG system.

Check out our demo which uses AT2 for citations in an LLM-powered search tool: https://t.co/nxnkEkYUFM (7/8) https://t.co/Yigi1FrkXH

1

9

2

4

576

Ben Cohen-Wang @bcohenwang

over 1 year ago

Increasingly, LLMs cite sources for claims they make, but are the sources they cite actually what they are using? In work led by @YungSungChuang, we design a reward to quantify this, and use this reward to (automatically) improve citation quality! 🧵

Yung-Sung Chuang @YungSungChuang

over 1 year ago

(1/5)🚨LLMs can now self-improve to generate better citations✅ 📝We design automatic rewards to assess citation quality 🤖Enable BoN/SimPO w/o external supervision 📈Perform close to “Claude Citations” API w/ only 8B model 📄https://t.co/FHj54HiC6i 🧑‍💻https://t.co/nQa87KkYMo

YungSungChuang's tweet photo. (1/5)🚨LLMs can now self-improve to generate better citations✅

📝We design automatic rewards to assess citation quality
🤖Enable BoN/SimPO w/o external supervision
📈Perform close to “Claude Citations” API w/ only 8B model

📄https://t.co/FHj54HiC6i
🧑‍💻https://t.co/nQa87KkYMo

12

313

75

193

40K

0

17

1

2

998

Ben Cohen-Wang @bcohenwang

about 2 years ago

@kellerjordan0 @aleks_madry @josh_vendrow This is really cool! thanks for raising!

1

3

0

108

Ben Cohen-Wang @bcohenwang

about 2 years ago

@feng_jiahai @harshays_ @kris_georgiev1 @aleks_madry This would be nice to have for k>1 to contextualize these values, but becomes very hard to compute.

0

833

Ben Cohen-Wang @bcohenwang

about 2 years ago

We introduce ContextCite, a tool that can help us understand when and how an LLM uses in-context information! w/ @harshays_, @kris_georgiev1, @aleks_madry Check out our demo: https://t.co/9sV2jCEwAO Thread ⤵️

Aleksander Madry @aleks_madry

about 2 years ago

How is an LLM actually using the info given to it in its context? Is it misinterpreting anything or making things up? Introducing ContextCite: a simple method for attributing LLM responses back to the context: https://t.co/bm1t7nybbh w/ @bcohenwang, @harshays_, @kris_georgiev1

7

241

46

232

51K

3

45

10

23

8K

Ben Cohen-Wang @bcohenwang

about 2 years ago

@feng_jiahai @harshays_ @kris_georgiev1 @aleks_madry Great point, yeah! For k=1 we're pretty much at this optimal log-prob drop (we'll include a formal evaluation in the paper, but you can already see for k=1 things look pretty saturated as we increase the number of ablations in the plots in the blog post).

1

0

831

Ben Cohen-Wang @bcohenwang

about 2 years ago

@cloutiness @aleks_madry @harshays_ @kris_georgiev1 We have an example notebook of using ContextCite with RAG: https://t.co/U2AknMj99B

0

1

0

86

Ben Cohen-Wang @bcohenwang

about 2 years ago

@xilinniao @harshays_ @kris_georgiev1 @aleks_madry Hi great question! This is definitely possible (this type of approach is usually called "leave-one-out"). We've tried this and it works reasonably well but is a lot more expensive than ContextCite. ContextCite only needs a small number of ablations due to sparsity (see our blog).

0

2

0

131

Ben Cohen-Wang

@bcohenwang

Who to follow

Last Seen Users on Sotwe

Trends for you

Most Popular Users