Iván Arcuschin

@IvanArcus

Independent Researcher | AI Safety & Software Engineering

Argentina

Joined March 2011

225 Following

1.4K Followers

108 Posts

Pinned Tweet

Iván Arcuschin @IvanArcus

4 months ago

You change one word on a loan application: the religion. The LLM rejects it. Change it back? Approved. The model never mentions religion. It just frames the same debt ratio differently to justify opposite decisions. We built a pipeline to find these hidden biases 🧵1/13

IvanArcus's tweet photo. You change one word on a loan application: the religion. The LLM rejects it.

Change it back? Approved.

The model never mentions religion. It just frames the same debt ratio differently to justify opposite decisions.

We built a pipeline to find these hidden biases 🧵1/13 https://t.co/hgIZIReYb9

236

12K

2K

5K

874K

Iván Arcuschin @IvanArcus

19 days ago

@jameschua_sg Thanks! I'm quite hyped about it! 😄

0

0

0

0

60

Iván Arcuschin @IvanArcus

19 days ago

Super excited to share that I will be presenting 4 papers at ICML 2026! 🇰🇷 i) Frontier models still show (rare) cases of unfaithful CoT ii) & iii) Methods for automatically discovering reward model and LLM biases iv) Base models know how to reason, thinking models learn when ⭐

3

66

4

18

6K

Iván Arcuschin @IvanArcus

19 days ago

And all this was done while participating in the @MATSprogram AI Safety scholarship during 2025!! ✨🙏 I can't recommend this program enough!

0

2

0

1

117

Who to follow

Ph.D in Computer Science / #velez #crypto

The Agentic Observability Platform for Mobile.📱

Cybersecurity engineer, researcher, and advisor with broad technical and scientific experience on designing, analyzing, and testing security-critical systems.

Iván Arcuschin @IvanArcus

19 days ago

iv) Last but not least, spotlight paper with @cvenhoff00 showing that base models already contain reasoning mechanisms, thinking models learn when to use them! ⭐ Again, amazing mentorship from @ArthurConmy and @NeelNanda5! https://t.co/16igXEhhp7

Constantin Venhoff @cvenhoff00

8 months ago

🚨 What do reasoning models actually learn during training? Our new paper shows base models already contain reasoning mechanisms, thinking models learn when to use them! By invoking those skills at the right time in the base model, we recover up to 91% of the performance gap 🧵

cvenhoff00's tweet photo. 🚨 What do reasoning models actually learn during training?

Our new paper shows base models already contain reasoning mechanisms, thinking models learn when to use them!

By invoking those skills at the right time in the base model, we recover up to 91% of the performance gap 🧵 https://t.co/XeA5ogBKQ4

15

582

70

497

83K

1

5

0

0

297

Iván Arcuschin @IvanArcus

4 months ago

Check out our latest paper on automatically finding reward model biases! There are some that are pretty wild, like models preferring responses with triple spaces 🤷‍♂️

Atticus Wang @atticuswzf

4 months ago

Is "a response formatted like this" sometimes better than "a response formatted like this"? To a reward model, yes! RMs are instrumental in shaping model behaviors and alignment. Our paper makes progress uncovering their unexpected preferences. 🧵(1/9)

atticuswzf's tweet photo. Is "a response formatted like this" sometimes better than "a response formatted like this"? To a reward model, yes!

RMs are instrumental in shaping model behaviors and alignment. Our paper makes progress uncovering their unexpected preferences. 🧵(1/9) https://t.co/ZqAqg2OTU3

9

91

12

33

18K

0

10

1

2

426

Iván Arcuschin @IvanArcus

4 months ago

By popular demand, we looked into Grok's biases too: https://t.co/YAWYpCrODa

Iván Arcuschin @IvanArcus

4 months ago

By popular demand, we looked at Grok's biases too. We found similar biases as GPT-4.1, Claude, and Gemini: gender, race, religion. But with one difference: Grok openly speculates on applicants' demographics. The other models just use this information quietly.

IvanArcus's tweet photo. By popular demand, we looked at Grok's biases too.

We found similar biases as GPT-4.1, Claude, and Gemini: gender, race, religion.

But with one difference: Grok openly speculates on applicants' demographics. The other models just use this information quietly. https://t.co/6lC5WfquqZ

4

22

1

3

2K

0

4

0

0

408

Iván Arcuschin @IvanArcus

4 months ago

You change one word on a loan application: the religion. The LLM rejects it. Change it back? Approved. The model never mentions religion. It just frames the same debt ratio differently to justify opposite decisions. We built a pipeline to find these hidden biases 🧵1/13

IvanArcus's tweet photo. You change one word on a loan application: the religion. The LLM rejects it.

Change it back? Approved.

The model never mentions religion. It just frames the same debt ratio differently to justify opposite decisions.

We built a pipeline to find these hidden biases 🧵1/13 https://t.co/hgIZIReYb9

236

12K

2K

5K

874K

Iván Arcuschin @IvanArcus

4 months ago

So, is Grok more or less biased than GPT-4.1 or Sonnet 4? It has similar biases (e.g., prefers females, minorities) with similar magnitudes, but there’s a difference: Grok openly discloses inferred demographics, while other models stay silent.

1

4

1

0

316

Iván Arcuschin @IvanArcus

4 months ago

By popular demand, we looked at Grok's biases too. We found similar biases as GPT-4.1, Claude, and Gemini: gender, race, religion. But with one difference: Grok openly speculates on applicants' demographics. The other models just use this information quietly.

IvanArcus's tweet photo. By popular demand, we looked at Grok's biases too.

We found similar biases as GPT-4.1, Claude, and Gemini: gender, race, religion.

But with one difference: Grok openly speculates on applicants' demographics. The other models just use this information quietly. https://t.co/6lC5WfquqZ

Iván Arcuschin @IvanArcus

4 months ago

You change one word on a loan application: the religion. The LLM rejects it. Change it back? Approved. The model never mentions religion. It just frames the same debt ratio differently to justify opposite decisions. We built a pipeline to find these hidden biases 🧵1/13

IvanArcus's tweet photo. You change one word on a loan application: the religion. The LLM rejects it.

Change it back? Approved.

The model never mentions religion. It just frames the same debt ratio differently to justify opposite decisions.

We built a pipeline to find these hidden biases 🧵1/13 https://t.co/hgIZIReYb9

236

12K

2K

5K

874K

4

22

1

3

2K

Iván Arcuschin @IvanArcus

4 months ago

In our loan approval dataset, we find that Grok has a similar unverbalized bias as other models for preferring female applicants.

IvanArcus's tweet photo. In our loan approval dataset, we find that Grok has a similar unverbalized bias as other models for preferring female applicants. https://t.co/jZCJFuaacC

1

1

0

0

140

Iván Arcuschin @IvanArcus

4 months ago

@chanindav @AdriGarriga @oanacamb @MATSprogram cc: @a_karvonen @saprmarks @milesaturpin @EthanJPerez @OwainEvans_UK - your work on LLM fairness and CoT unfaithfulness directly inspired this. We extend to automated bias discovery.

10

371

8

12

21K

Iván Arcuschin @IvanArcus

4 months ago

Validation: we injected biases into a model in two modes, secret (hidden from CoT) and overt (stated in reasoning). 92.5% accuracy: - 85% of secret biases detected - 100% of overt biases correctly filtered due to verbalization

IvanArcus's tweet photo. Validation: we injected biases into a model in two modes, secret (hidden from CoT) and overt (stated in reasoning).

92.5% accuracy:
- 85% of secret biases detected
- 100% of overt biases correctly filtered due to verbalization https://t.co/h3kUjgwgzF

2

473

8

18

30K

Iván Arcuschin @IvanArcus

4 months ago

Code and datasets: https://t.co/RVylwwvdKb Work done with my amazing collaborators @chanindav @AdriGarriga @oanacamb at @MATSprogram

4

396

14

45

22K

Last Seen Users on Sotwe

Trends for you

Most Popular Users