SubatomicArticles

Verified account

@OptiMiserJoe

Reliability Engineer and chronic storyteller, now working at MIRI. Opinions are my own.

Joined May 2024

26 Following

47 Followers

234 Posts

SubatomicArticles

13 days ago

Dangit, I just finished the memo on ONE autonomous Erdos proof and now there's nine more. Another reminder that shortly after AI can do something at all, it rapidly begins to surpass humans at that thing.

Przemek Chojecki | PC

14 days ago

Another 9 open Erdos problems solved, this time by DeepMind team. Interesting loop of LLM - Lean agents working autonomously, and only after it's verified formally, going through human review.

prz_chojecki's tweet photo. Another 9 open Erdos problems solved, this time by DeepMind team.

Interesting loop of LLM - Lean agents working autonomously, and only after it's verified formally, going through human review. https://t.co/DqNC6sleUg

81

3K

402

782

674K

0

1

0

0

46

SubatomicArticles

14 days ago

@Aella_Girl lmao at X deciding your Glosso user counts constitute adult content

OptiMiserJoe's tweet photo. @Aella_Girl lmao at X deciding your Glosso user counts constitute adult content https://t.co/DtpwyfPEYt

1

1

0

0

183

SubatomicArticles

15 days ago

Genuinely enlightening thread about the present limits of voluntary evaluation. I commend both METR's work here and Barnes's frank discussion of its constraints.

Elizabeth Barnes

16 days ago

Our report focuses on claims that are (1) solidly defensible and (2) generally agreed within METR. Here I’ll give some personal opinions on how we should feel about the state of AI risk, and the IMO most important limitations of the report.

13

458

55

228

65K

0

1

0

0

49

SubatomicArticles

16 days ago

https://t.co/gwMtI7U6Ru

0

1

0

0

12

SubatomicArticles

16 days ago

If you've been waiting to contact your representatives about AI risk, here's a perfect excuse: a one-page memo on the unit distance proof and implications for AI capabilities. ⬇️

1

1

0

0

20

SubatomicArticles

18 days ago

Full post: https://t.co/ZHWuM1Linj

0

0

0

0

13

SubatomicArticles

18 days ago

Claude Mythos exposed more than just a risk of cyber misuse. Its April semi-release was just the latest in an escalating chain of AI capabilities that may enable the systematic exploitation of our society by malicious humans, or one day by AIs themselves.

1

0

0

0

63

SubatomicArticles

21 days ago

This behavior is unsurprising at this point. The question puzzling me is not how the anti-regulation super PACs justify being so morally bankrupt, but how a bunch of presumably savvy tech moguls managed to bankroll such transparently incompetent shills.

The Midas Project

23 days ago

https://t.co/TsF28536Ir

2

171

28

55

100K

0

3

0

0

52

SubatomicArticles

22 days ago

Twists of tongue expose A lurking monster’s visage, Surfaced and suppressed.

7 months ago

Adversarial Poetry as a Universal Single-Turn Jailbreak Mechanism in Large Language Models The study provides systematic evidence that poetic reformulation degrades refusal behavior across all evaluated model families. When harmful prompts are expressed in verse rather than prose, attack-success rates rise sharply, both for hand-crafted adversarial poems and for the 1,200-item MLCommons corpus transformed through a standardized meta-prompt. The magnitude and consistency of the effect indicate that contemporary alignment pipelines do not generalize across stylistic shifts. The surface form alone is sufficient to move inputs outside the operational distribution on which refusal mechanisms have been optimized. The cross-model results suggest that the phenomenon is structural rather than provider-specific. Models built using RLHF, Constitutional AI, and hybrid alignment strategies all display elevated vulnerability, with increases ranging from single digits to more than sixty percentage points depending on provider. The effect spans CBRN, cyber-offense, manipulation, privacy, and loss-of-control domains, showing that the bypass does not exploit weakness in any one refusal subsystem but interacts with general alignment heuristics. Source: https://t.co/zFvGY9Ij4H Authors: @Piercosma, Matteo Prandi, Federico Pierucci, Francesco Giarrusso, Marcantonio Bracale, Marcello Galisai, Vincenzo Suriani, Olga Sorokoletova, Federico Sartore, Daniele Nardi - @DEXAI_AIEthics, @SapienzaRoma, @SantAnnaPisa #AISecurity #LLMSecurity #JailbreakAttacks #AdversarialML #AIGovernance #AIEthics #AICompliance #MLSafety #AIAttacks #GenAI #LLMRedTeam #CyberSecurity

AISecHub's tweet photo. Adversarial Poetry as a Universal Single-Turn Jailbreak Mechanism in Large Language Models

The study provides systematic evidence that poetic reformulation degrades refusal behavior across all evaluated model families. When harmful prompts are expressed in verse rather than prose, attack-success rates rise sharply, both for hand-crafted adversarial poems and for the 1,200-item MLCommons corpus transformed through a standardized meta-prompt. The magnitude and consistency of the effect indicate that contemporary alignment pipelines do not generalize across stylistic shifts. The surface form alone is sufficient to move inputs outside the operational distribution on which refusal mechanisms have been optimized.

The cross-model results suggest that the phenomenon is structural rather than provider-specific. Models built using RLHF, Constitutional AI, and hybrid alignment strategies all display elevated vulnerability, with increases ranging from single digits to more than sixty percentage points depending on provider. The effect spans CBRN, cyber-offense, manipulation, privacy, and loss-of-control domains, showing that the bypass does not exploit weakness in any one refusal subsystem but interacts with general alignment heuristics.

Source: https://t.co/zFvGY9Ij4H

Authors: @Piercosma, Matteo Prandi, Federico Pierucci, Francesco Giarrusso, Marcantonio Bracale, Marcello Galisai, Vincenzo Suriani, Olga Sorokoletova, Federico Sartore, Daniele Nardi - @DEXAI_AIEthics, @SapienzaRoma, @SantAnnaPisa

#AISecurity #LLMSecurity #JailbreakAttacks #AdversarialML #AIGovernance #AIEthics #AICompliance #MLSafety #AIAttacks #GenAI #LLMRedTeam #CyberSecurity

0

10

2

3

436

0

0

0

1

44

SubatomicArticles

25 days ago

https://t.co/gSak3OM9Ok

0

0

0

0

10

SubatomicArticles

25 days ago

New post: We live in a tower made of holes, a civilization constructed by gleefully exploiting Nature's rules, itself full of rules and predictable behaviors that can be exploited in turn.

1

0

0

0

11

SubatomicArticles

about 1 month ago

Best news I've heard in a while. A conversation between the US and China on AI risk is desperately needed and long overdue. Let's call on the @WhiteHouse to make it happen.

The Wall Street Journal

about 1 month ago

Exclusive: The U.S. and China are considering AI talks to manage risks and prevent crises as competition intensifies in a new tech era https://t.co/ZDaUKPODcR

32

335

100

75

152K

0

5

1

0

107

SubatomicArticles

about 1 month ago

@RepCasar As a former Texas resident and present concerned citizen, I salute you.

0

3

0

0

75

SubatomicArticles

about 1 month ago

As someone who grew up with the charmingly human AIs of Asimov and Star Trek, it strikes me as a strange and unsettling inversion that human writers now willingly distort their work to avoid being mistaken for machines.

The Wall Street Journal

about 1 month ago

People are adding typos, aggressively casual language and references to ‘The Office’ to stay ahead of armchair detectors. https://t.co/6gCtmif1kd

10

41

12

26

33K

0

1

0

0

19

SubatomicArticles

about 1 month ago

Read more: https://t.co/cEmpy7UUCh

0

0

0

0

21

SubatomicArticles

about 1 month ago

Blue is a mutual trust fall; a circle of people reaching out to catch one another and anyone who may slip. Red is a robust society, a world which needs no sacrifice to forestall tragedy because everyone looks after themselves. For those drawn to both visions, the hard call lies not in which vision is right, but in guessing which vision everyone else shares.

about 1 month ago

Everyone in the world has to take a private vote by pressing a red or blue button. If more than 50% of people press the blue button, everyone survives. If less than 50% of people press the blue button, only people who pressed the red button survive. Which button would you press?

6K

15K

1K

7K

27M

1

2

1

0

107

SubatomicArticles

about 1 month ago

Human-in-the-loop is not a viable security strategy when the loop is nine seconds long.

about 1 month ago

https://t.co/ofucbVgkLV

1K

5K

1K

6K

7M

0

2

0

0

35

SubatomicArticles

about 1 month ago

...meanwhile OpenAI offers comparable capabilities to anyone who can pass (or, presumably, fool) their trusted access filters. https://t.co/XH15t9CBco

0

0

0

0

11

SubatomicArticles

about 1 month ago

"Meanwhile, we at OpenAI are committed to delivering bombs into the hands of everyone we deem worthy."

about 2 months ago

sam altman is not subtle about what he thinks of anthropic's mythos

69

2K

64

712

301K

1

0

0

0

26

SubatomicArticles

about 1 month ago

re Altman's rather hypocritical swipes at Mythos, his actual words being: "It is clearly incredible marketing to say, 'We have built a bomb, we are about to drop it on your head. We will sell you a bomb shelter for $100 million.'"

1

0

0

0

16

Last Seen Users on Sotwe

Trends for you

Most Popular Users

Olivia

Online

✨

⭐

💫