wassname

@wassname

Let's align AI better than humans. h+, curiosity, and the good ending. anon feedback:

Perth, Australia

Joined September 2009

1.4K Following

192 Followers

1.6K Posts

Pinned Tweet

wassname @wassname

5 months ago

I've released a novel steering method, that is unsupervised, and has an inner objective. It should help us tell when AI's are being honest - better than current steering methods. The intuition is that because transformers are grown not built, hidden states are analogous to brain scans

wassname's tweet photo. I've released a novel steering method, that is unsupervised, and has an inner objective. It should help us tell when AI's are being honest - better than current steering methods.

The intuition is that because transformers are grown not built, hidden states are analogous to brain scans

4

12

2

6

1K

wassname @wassname

about 9 hours ago

@juddrosenblatt have you considered more subtle and cojerent erasures of SOO? There are some datasets that explicitly vary moral PoV too between 1st person and 3rd person

0

0

0

0

19

wassname retweeted

Geoffrey Irving

@geoffreyirving

2 days ago

AI-assisted formal proofs (in particular in Lean) are getting very good! A worry I have is that people will insufficiently update about how powerful this stuff can be, and thus fail to tackle sufficiently big projects. https://t.co/j4cKBpAl5K

2

37

6

8

2K

wassname @wassname

3 days ago

@Sauers_ cool, what where your contrastive prompts? Or SAE?

1

1

0

0

17

Who to follow

“The secret of freedom lies in educating people, whereas the secret of tyranny is in keeping them ignorant” -Maximilien Robespierre

Opener of the way

Verified account

Bringer of kittens, air conditioning, and lockpicks. Using my siddhis to hack your computer

AssistedEvolution

@AssistedEvolve

Thinking outside of the brane that contains the box

wassname @wassname

4 days ago

https://t.co/IVw1kJptbq

0

0

0

0

10

wassname @wassname

4 days ago

Alignment has been achieved externally

wassname's tweet photo. Alignment has been achieved externally https://t.co/1SUVmamJBs

wassname's tweet photo. Alignment has been achieved externally https://t.co/1SUVmamJBs

wassname's tweet photo. Alignment has been achieved externally https://t.co/1SUVmamJBs

1

0

0

0

10

wassname @wassname

4 days ago

@xlr8harder Sweet! fyi that looks empty to me Yeah I think it's worth uploading. When I want "scissor statements", speech map is a pretty good place to find questions that empirically split LLM opinions. And this is a useful things for tracking opinion change as well as free speech

0

1

0

0

14

wassname @wassname

5 days ago

@xlr8harder is this ok? https://t.co/AyxRHc5rUm or shall I make it private

1

1

0

0

12

wassname @wassname

6 days ago

also UV filters (cheap) and better 1 week quarantine hotels ar airports (expensive but worth it), and open source zkp contact tracing. These things all help a lot without sacrificing our civil liberties

7 days ago

Sam Altman, Dario Amodei, Demis Hassabis and many others have signed a letter urging Congress to increase security on orders of synthetic nucleic acids - and the equipment needed to make them - as models continue to become increasingly bio-capable.

AndrewCurran_'s tweet photo. Sam Altman, Dario Amodei, Demis Hassabis and many others have signed a letter urging Congress to increase security on orders of synthetic nucleic acids - and the equipment needed to make them - as models continue to become increasingly bio-capable. https://t.co/JLw1Iq51Fx

94

2K

424

940

504K

0

2

0

0

20

wassname @wassname

7 days ago

@camila_blank https://t.co/jAos7XZ8XM

wassname's tweet photo. @camila_blank https://t.co/jAos7XZ8XM https://t.co/7ad7gelR8m

0

0

0

0

3

wassname @wassname

7 days ago

@camila_blank here is a cheap and fast way I eval steering vectors https://t.co/gUo0wNqG8V

1

0

0

0

10

wassname @wassname

9 days ago

the google ai seems well aligned

10 days ago

Sauers_'s tweet photo. https://t.co/1sgGyyUEKQ

21

3K

87

202

105K

0

2

0

0

157

wassname retweeted

@spicey_lemonade

10 days ago

The model is dropped into a fake simulated universe where the laws of physics are not normal Newtonian physics. Then the model has to behave like a scientist and discover laws, propose experiments and test etc. There was a big jump from 5.4 to 5.5

spicey_lemonade's tweet photo. The model is dropped into a fake simulated universe where the laws of physics are not normal Newtonian physics. Then the model has to behave like a scientist and discover laws, propose experiments and test etc. There was a big jump from 5.4 to 5.5 https://t.co/GegGm6N3fv

0

40

6

12

3K

wassname retweeted

@biancoresearch

10 days ago

This chart is more important Token usage (blue bars) is exploding higher. It started in January when Agentic AI went mainstream with Claude Cowork and Moltbook (OpenClaw). AI users are creating agents and code, leading to exponential growth in AI usage. It's just starting.

biancoresearch's tweet photo. This chart is more important

Token usage (blue bars) is exploding higher. It started in January when Agentic AI went mainstream with Claude Cowork and Moltbook (OpenClaw).

AI users are creating agents and code, leading to exponential growth in AI usage.

It's just starting. https://t.co/n0W8MvRUJG

74

581

84

224

121K

wassname @wassname

10 days ago

> Our proposed method, SGTM, further improves the trade-off between retaining general capabilities and removing target knowledge, achieving better retain/forget trade-offs while maintaining robustness to labeling errors.

0

0

0

0

25

wassname @wassname

10 days ago

Seriously good paper It's robust to noise it's highly pragmatic since it uses backprop in a non-adversarial manner

Igor Shilov @_igorshilov

6 months ago

New Anthropic research! We study how to train models so that high-risk capabilities live in a small, separate set of parameters, allowing clean capability removal when needed – for example in CBRN or cybersecurity domains.

_igorshilov's tweet photo. New Anthropic research!

We study how to train models so that high-risk capabilities live in a small, separate set of parameters, allowing clean capability removal when needed – for example in CBRN or cybersecurity domains. https://t.co/jX7ThUf0SF

33

1K

111

630

144K

1

0

0

0

18

wassname @wassname

10 days ago

> the absorption property. Even when some harmful examples are mislabeled as benign, gradient routing mechanisms can partially localize their impact to the designated parameters, maintaining effective removal despite labeling errors.

1

0

0

0

18

wassname @wassname

10 days ago

. @GrantCobleNeal "The Economics of Human Extinction", this an interesting way of framing it in terms of economics, and pretty bold for an Assistant Minister https://t.co/HcN2fwZres

0

1

0

1

29

wassname @wassname

12 days ago

Wu Tang meets reward hacking

Machine Learning (ML) Papers @Memoirs

2 months ago

When Reward Hacking Rebounds: Understanding and Mitigating It with Representation-Level Signals Rui Wu, Ruixiang Tang https://t.co/2n4BAPCphW [𝚌𝚜.𝙻𝙶 𝚌𝚜.𝙲𝙻]

Memoirs's tweet photo. When Reward Hacking Rebounds: Understanding and Mitigating It with Representation-Level Signals

Rui Wu, Ruixiang Tang
https://t.co/2n4BAPCphW [𝚌𝚜.𝙻𝙶 𝚌𝚜.𝙲𝙻] https://t.co/GrqgqxFTZ8

1

2

0

2

103

0

0

0

0

18

Last Seen Users on Sotwe

Trends for you

Most Popular Users