Oscar Balcells Obeso @OBalcells - Twitter Profile

Pinned Tweet

9 months ago

Imagine if ChatGPT highlighted every word it wasn't sure about. We built a streaming hallucination detector that flags hallucinations in real-time.

205

9K

610

4K

747K

OBalcells retweeted

Sam Bowman

@sleepinyourhat

about 2 months ago

(I encountered an uneasy surprise when I got an email from an instance of Mythos Preview while eating a sandwich in a park. That instance wasn't supposed to have access to the internet.)

52

2K

266

513

396K

OBalcells retweeted

Leo Gao

@nabla_theta

3 months ago

@boazbaraktcs - what happens when the model/safety stack refuses DoW queries? if the DoW gets mad and strongarms openai, like they just did to anthropic, how is openai going to resist? especially if openai doesn't even have the strong contractual protection

1

130

2

0

4K

OBalcells retweeted

Anthropic

@AnthropicAI

3 months ago

A statement from Anthropic CEO, Dario Amodei, on our discussions with the Department of War. https://t.co/rM77LJejuk

4K

55K

9K

17M

Who to follow

Henri Thunberg 🔸

@HenriThunberg

Raising funds for impactful causes, e.g. as chairman of @geeffektivt. A- calibration, D- takes. Proud 🔸10% Pledge #5185 with @GivingWhatWeCan

OBalcells retweeted

roon

@tszzl

4 months ago

it’s just so clear humans are the bottleneck to writing software. number of agents we can manage, information flow, state management. there will just be no centaurs soon as it is not a stable state

172

2K

89

338

209K

OBalcells retweeted

Neel Nanda

@NeelNanda5

5 months ago

I'll be accepting late applications to my summer MATS stream until Jan 2nd! If you want to do mech interp research supervised by me, please apply

6

163

13

99

20K

OBalcells retweeted

Ethan Perez

@EthanJPerez

6 months ago

Fellows grads have started to get a reputation as some of the steepest trajectory researchers at Anthropic. So we’re excited to expand the program and help mentor more new AI safety researchers

6

415

31

269

46K

OBalcells retweeted

Leo Gao

@nabla_theta

6 months ago

New post: An Ambitious Vision for Interpretability Understanding is essential for ensuring things don't break unexpectedly. AMI is a big risky bet, but so is all ambitious research. AMI is tractable: it has good empirical feedback loops, and we've already made a lot of progress.

nabla_theta's tweet photo. New post: An Ambitious Vision for Interpretability

Understanding is essential for ensuring things don't break unexpectedly. AMI is a big risky bet, but so is all ambitious research. AMI is tractable: it has good empirical feedback loops, and we've already made a lot of progress. https://t.co/aBHBvFsGuJ

12

240

27

89

55K

OBalcells retweeted

Neel Nanda

@NeelNanda5

6 months ago

The GDM mechanistic interpretability team has pivoted to a new approach: pragmatic interpretability Our post details how we now do research, why now is the time to pivot, why we expect this way to have more impact and why we think other interp researchers should follow suit

NeelNanda5's tweet photo. The GDM mechanistic interpretability team has pivoted to a new approach: pragmatic interpretability

Our post details how we now do research, why now is the time to pivot, why we expect this way to have more impact and why we think other interp researchers should follow suit https://t.co/ZaJffEJPKj

29

662

87

437

250K

Oscar Balcells Obeso @OBalcells

6 months ago

👀

Adam Karvonen

@a_karvonen

6 months ago

Can you trust your LLM inference provider? What about your own infrastructure? Inference problems are everywhere. We introduce Token-DiFR, a simple solution. It can easily detect when inference has degraded (like bugs or hidden quantization) with no provider overhead.

a_karvonen's tweet photo. Can you trust your LLM inference provider? What about your own infrastructure? Inference problems are everywhere.

We introduce Token-DiFR, a simple solution. It can easily detect when inference has degraded (like bugs or hidden quantization) with no provider overhead. https://t.co/QkqRp4lJKq

3

79

6

34

11K

0

3

0

1

538

OBalcells retweeted

@levelsio

7 months ago

🇪🇺 As a European citizen and AI founder, I can apparently use these "AI Factories", so I just signed up to use them! Every "supercomputer" has an [ ACCESS NOW ] button which made me very excited I expected to sign up, maybe pay a discounted H100 rate (funded by EU, that'd be nice?) and get a Jypyter notebook, or some SSH login so I can access my GPU like I'd do on @lambdaapi or @awscloud or @Hetzner_Online But I celebrated to early, I signed up, confirmed my email, then ended up in a "Supercomputer Access Calls" page, where I had to select from a tedious list of "Call For Proposals" to get access to a GPU So I could NOT just access a H100 GPU, I have to make sure my project (in this case my business) fits a specific proposal, ok fair This process was already tedious enough but then when I tried to actually go through with it, it started asking me if I had "Respect for Human Agency?", I do I think, and if I was mindful of "Individual, and Social and Environmental Well-Being?", well I am, right guys??? Right??? The questions didn't stop, just endless pages of this Look I get what they're doing, they pivoted the classic university "I need to rent a giant computer for my research" to an EU wide thing and then present it as the "European AI plan" But this isn't really how AI works in production? As a founder in AI, if I wanna do stuff I'd rent a whole bunch H100 GPUs again at @lambdaapi or @awscloud or @Hetzner_Online and SSH into a box Or if I want it more simple I run AI models on @FAL, @wavespeed or @replicate which is just an API call or web front end I can click stuff and run a model The EU has the right intentions here but it's just the wrong execution, this thing will 100% go nowhere, and I'm a born optimist, I want to believe, I'm also a proud European, and I'm in AI a bit and not a complete idiot. There's just better ways to do this If you really want to have the GPU servers in Europe (which arguably isn't that important), then let me rent a GPU box with SSH access at @Hetzner_Online or @OVHcloud that's hosted in Europe and subsidize that for European citizens and European businesses. I don't even believe in that, but at least that'd make it accessible for Europeans. Now it really isn't? What's REALLY much more important though if you want to be a part of the AI race and I've posted for years here with @euaccofficial is to make Europe a really extremely attractive place to start and run an AI business. Remove regulatory obstructions and give tax discounts for startups. Let them build a business first that can compete worldwide and once they make enough money (let's say $100M/y), then slowly start adding regulation. Because right now the regulation only benefits the European incumbents, the dinosaur companies, while making it very difficult for European citizens to start new AI companies here. Which is why we literally have none left. Anyway, I applied to get my GPU, let's see if I get it!

390

5K

457

2K

1M

OBalcells retweeted

Andy Arditi @andyarditi

9 months ago

We found "misaligned persona" features in Llama and Qwen that mediate emergent misalignment. Fine-tuning on bad medical advice strengthens these pre-existing features, causing broader undesirable behavior. https://t.co/NEvfwVuRgG

1

83

13

40

14K

OBalcells retweeted

Andy Arditi @andyarditi

9 months ago

Wouldn't it be great if chat models could indicate their uncertainty as they write? Our new paper is a concrete step towards this vision, using internal representations to predict hallucination risk in real-time.

3

54

3

12

24K

Oscar Balcells Obeso @OBalcells

9 months ago

@koltregaskes Ah I see. The annotations are quite expensive to do: ~1M tokens and 15 google searches to annotate a single completion. You could scale this up with a larger token (or API) budget.

2

9

0

1K

Oscar Balcells Obeso @OBalcells

9 months ago

Imagine if ChatGPT highlighted every word it wasn't sure about. We built a streaming hallucination detector that flags hallucinations in real-time.

205

9K

610

4K

747K

Oscar Balcells Obeso @OBalcells

9 months ago

@MacGraeme42 It’s not based on the token probabilities. What we train is a simple binary linear (or more complicated too) classifier on the internal activations of the model.

2

44

0

5

3K

Oscar Balcells Obeso @OBalcells

9 months ago

@MrUmberto_ True. Llama 3.3 70B hallucinates a lot. Check out some other examples in our website (https://t.co/qxxTfmzalX)

0

5

0

2

550

Oscar Balcells Obeso @OBalcells

9 months ago

@thelokasiffers @antirez Yep, I have found the logprobs to be quite useful in some cases to spot-check the factuality of completions. We include this as a baseline in our paper.

0

8

0

795

Oscar Balcells Obeso @OBalcells

9 months ago

@_aftz Perplexity (or equivalently the logprobs) are a baseline we compare to.

1

3

0

677

Oscar Balcells Obeso @OBalcells

9 months ago

We use some well-known datasets of prompts such as HealthBench and Longfact. We also generate our own set of prompts (we call it Longfact++ in the paper). With these prompt datasets we do rollouts with each model and then we annotate the completions (I.e fact-check them) using claude+search.

0

16

0

5

2K

Oscar Balcells Obeso @OBalcells

9 months ago

This is something we wanted to check but haven’t yet. It would be interesting follow-up work. We’d like try it out on some honesty datasets to see if it can detect lying. I don’t think that the model internally represents lying (deceptively) in the same way as hallucination but who knows.

2

23

0

1

3K

Oscar Balcells Obeso @OBalcells

9 months ago

@robeardius Alright😔

0

11

0

1

1K

Oscar Balcells Obeso

@OBalcells

Who to follow

Last Seen Users on Sotwe

Trends for you

Most Popular Users