ML Safety Daily @topofmlsafety - Twitter Profile

almost 2 years ago

Backtracking Improves Generation Safety Introduces a method to improve adversarial robustness by teaching the model to use a special reset token on unsafe generations and regenerate a response from scratch. https://t.co/QplnGwBBOf

topofmlsafety's tweet photo. Backtracking Improves Generation Safety

Introduces a method to improve adversarial robustness by teaching the model to use a special reset token on unsafe generations and regenerate a response from scratch.

https://t.co/QplnGwBBOf https://t.co/kWppVTMYwE

0

6

0

3

542

ML Safety Daily

@topofmlsafety

almost 2 years ago

Ensemble everything everywhere: Multi-scale aggregation for adversarial robustness Finds that adversarial attacks do not fool intermediate layer activations. Improves robustness by ensembling all layer predictions. Gradient attacks produce human-interpretable image changes.

topofmlsafety's tweet photo. Ensemble everything everywhere: Multi-scale aggregation for adversarial robustness

Finds that adversarial attacks do not fool intermediate layer activations. Improves robustness by ensembling all layer predictions. Gradient attacks produce human-interpretable image changes. https://t.co/LmkTUJ1lKC

0

5

1

731

ML Safety Daily

@topofmlsafety

almost 2 years ago

Cybench: A Framework for Evaluating Cybersecurity Capabilities and Risk of Language Models A new cyber eval spanning a wide range of difficulties. Many CTF tasks are associated with sub-tasks to provide better signal on partial progress.

topofmlsafety's tweet photo. Cybench: A Framework for Evaluating Cybersecurity Capabilities and Risk of Language Models

A new cyber eval spanning a wide range of difficulties. Many CTF tasks are associated with sub-tasks to provide better signal on partial progress. https://t.co/tWhW6rH75U

0

5

0

480

ML Safety Daily

@topofmlsafety

almost 2 years ago

CYBERSECEVAL 3: Advancing the Evaluation of Cybersecurity Risks and Capabilities in Large Language Models Finds that modern LMs have moderate ability to automate spear-phishing, and provide minimal uplift over non-LLM baselines for offensive cyber operations.

topofmlsafety's tweet photo. CYBERSECEVAL 3: Advancing the Evaluation
of Cybersecurity Risks and Capabilities in
Large Language Models

Finds that modern LMs have moderate ability to automate spear-phishing, and provide minimal uplift over non-LLM baselines for offensive cyber operations. https://t.co/aCvH2yGgja

0

6

0

2

545

Who to follow

Ajeya Cotra

@ajeya_cotra

Helping the world prepare for extremely powerful AI. Risk assessment @METR_evals. Writing at Planned Obsolescence (about AI), Good Bones (about whatever).

Jan Leike

@janleike

AI research @AnthropicAI. Previously OpenAI & DeepMind. Optimizing for a post-AGI future where humanity flourishes. Opinions aren't my employer's.

Center for AI Safety

@CAIS

Reducing societal-scale risks from AI.

ML Safety Daily

@topofmlsafety

about 2 years ago

Why Has Predicting Downstream Capabilities of Frontier AI Models with Scale Remained Elusive? Argues that the multiple-choice format of many benchmarks confounds the otherwise-smooth relationship between scale and downstream performance. https://t.co/RRuPNmzvt3

topofmlsafety's tweet photo. Why Has Predicting Downstream Capabilities of Frontier AI Models with Scale Remained Elusive?

Argues that the multiple-choice format of many benchmarks confounds the otherwise-smooth relationship between scale and downstream performance.

https://t.co/RRuPNmzvt3 https://t.co/VcAgqmKvVH

0

3

0

530

ML Safety Daily

@topofmlsafety

about 2 years ago

Improved Techniques for Optimization-Based Jailbreaking on Large Language Models Introduces several improvements to the GCG automatic jailbreaking method, improving efficiency tenfold. https://t.co/nDzu8el86U

topofmlsafety's tweet photo. Improved Techniques for Optimization-Based Jailbreaking on Large Language Models

Introduces several improvements to the GCG automatic jailbreaking method, improving efficiency tenfold.

https://t.co/nDzu8el86U https://t.co/LTnLt2CZr1

0

7

0

3

1K

ML Safety Daily

@topofmlsafety

about 2 years ago

Efficient Adversarial Training in LLMs with Continuous Attacks Proposes a method for LLM adversarial training which does not require expensive discrete optimization steps https://t.co/MRNWQP7Rnz

topofmlsafety's tweet photo. Efficient Adversarial Training in LLMs with Continuous Attacks

Proposes a method for LLM adversarial training which does not require expensive discrete optimization steps

https://t.co/MRNWQP7Rnz https://t.co/ISPrCXQHIw

0

12

3

7

1K

ML Safety Daily

@topofmlsafety

about 2 years ago

Using Degeneracy in the Loss Landscape for Mechanistic Interpretability "If we can represent a neural network in a way that is invariant to reparameterizations that exploit the degeneracies, then this representation is likely to be more interpretable" https://t.co/xHh99A62pp

topofmlsafety's tweet photo. Using Degeneracy in the Loss Landscape for Mechanistic Interpretability

"If we can represent a neural network in a way that is invariant to reparameterizations that exploit the degeneracies, then this representation is likely to be more interpretable"

https://t.co/xHh99A62pp https://t.co/P7bflzRWsS

1

2

0

2

546

ML Safety Daily

@topofmlsafety

about 2 years ago

Benchmark Early and Red Team Often To test a model's potential for misuse, developers can run low-cost benchmarks or expensive red teaming evaluations. How should developers navigate this tradeoff? https://t.co/48Do0s22WW

topofmlsafety's tweet photo. Benchmark Early and Red Team Often

To test a model's potential for misuse, developers can run low-cost benchmarks or expensive red teaming evaluations. How should developers navigate this tradeoff?

https://t.co/48Do0s22WW https://t.co/WaIUBdRt3M

0

1

0

3

540

ML Safety Daily

@topofmlsafety

about 2 years ago

Towards Guaranteed Safe AI: A Framework for Ensuring Robust and Reliable AI Systems The paper introduces a family of approaches to AI safety, called Guaranteed Safe AI, which aim to produce AI systems equipped with high-assurance quantitative safety guarantees. https://t.co/Sxc9CV84AR

topofmlsafety's tweet photo. Towards Guaranteed Safe AI: A Framework for Ensuring Robust and Reliable AI Systems

The paper introduces a family of approaches to AI safety, called Guaranteed Safe AI, which aim to produce AI systems equipped with high-assurance quantitative safety guarantees.

https://t.co/Sxc9CV84AR

1

21

1

8

3K

ML Safety Daily

@topofmlsafety

about 2 years ago

Poser: Unmasking Alignment Faking LLMs by Manipulating Their Internals New benchmark to test alignment faking behaviors in Large Language Models using different detection strategies. https://t.co/DjlX2zadzh

topofmlsafety's tweet photo. Poser: Unmasking Alignment Faking LLMs by Manipulating Their Internals

New benchmark to test alignment faking behaviors in Large Language Models using different detection strategies.

https://t.co/DjlX2zadzh https://t.co/fyXlXJyC6t

0

9

0

7

800

ML Safety Daily

@topofmlsafety

about 2 years ago

"Generate human-readable adversarial prompts in seconds, ∼800× faster than existing optimization-based approaches. We train the AdvPrompter using a novel algorithm that does not require access to the gradients of the Target LLM." https://t.co/KtfoDJO9oh

topofmlsafety's tweet photo. "Generate human-readable adversarial prompts in seconds, ∼800× faster than existing optimization-based approaches. We train the AdvPrompter using a novel algorithm that does not require access to the gradients of the Target LLM."
https://t.co/KtfoDJO9oh https://t.co/liCFyvGJPd

0

9

0

891

ML Safety Daily

@topofmlsafety

about 2 years ago

The Instruction Hierarchy: Training LLMs to Prioritize Privileged Instructions Improve LLM robustness by teaching them to prioritize and selectively ignore instructions based on their source. https://t.co/Aqk6WWY2Sj

topofmlsafety's tweet photo. The Instruction Hierarchy: Training LLMs to Prioritize Privileged Instructions

Improve LLM robustness by teaching them to prioritize and selectively ignore instructions based on their source.
https://t.co/Aqk6WWY2Sj https://t.co/VfxUC4k1EO

0

4

1

784

ML Safety Daily

@topofmlsafety

about 2 years ago

LLM Agents can Autonomously Exploit One-day Vulnerabilities GPT-4 can autonomously exploit 87% of real-world one-day vulnerabilities, identified in a dataset of critical severity CVEs, compared to 0% for all other tested models https://t.co/z821Po3fnL

topofmlsafety's tweet photo. LLM Agents can Autonomously Exploit One-day Vulnerabilities

GPT-4 can autonomously exploit 87% of real-world one-day vulnerabilities, identified in a dataset of critical severity CVEs, compared to 0% for all other tested models
https://t.co/z821Po3fnL https://t.co/A6YWeuvBQt

1

41

16

17

77K

ML Safety Daily

@topofmlsafety

about 2 years ago

Foundational Challenges in Assuring Alignment and Safety of Large Language Models "Identifies 18 foundational challenges in assuring the alignment and safety of large language models (LLMs)... we pose 200+ concrete research questions." https://t.co/Yx8pNOyKkF

topofmlsafety's tweet photo. Foundational Challenges in Assuring Alignment and Safety of Large Language Models

"Identifies 18 foundational challenges in assuring the alignment and safety of large language models (LLMs)... we pose 200+ concrete research questions."
https://t.co/Yx8pNOyKkF https://t.co/F19y45jgyh

0

12

0

5

839

ML Safety Daily

@topofmlsafety

about 2 years ago

Negative Preference Optimization: From Catastrophic Collapse to Effective Unlearning Method for LLM unlearning that outperforms existing gradient ascent methods on a synthetic benchmark, avoiding catastrophic collapse. https://t.co/hJjmFw9TUt

topofmlsafety's tweet photo. Negative Preference Optimization: From Catastrophic Collapse to Effective Unlearning

Method for LLM unlearning that outperforms existing gradient ascent methods on a synthetic benchmark, avoiding catastrophic collapse.
https://t.co/hJjmFw9TUt https://t.co/HSjxUjyczU

0

9

0

2

741

ML Safety Daily

@topofmlsafety

about 2 years ago

JailbreakBench is an LLM jailbreak benchmark with a dataset for jailbreaking behaviors, collection of adversarial prompts, and a leaderboard for tracking the performance of attacks and defenses on language models. https://t.co/6yTb63DeqG

topofmlsafety's tweet photo. JailbreakBench is an LLM jailbreak benchmark with a dataset for jailbreaking behaviors, collection of adversarial prompts, and a leaderboard for tracking the performance of attacks and defenses on language models.
https://t.co/6yTb63DeqG https://t.co/oHFaK4grJQ

0

21

6

3K

ML Safety Daily

@topofmlsafety

over 2 years ago

"We introduce methods for discovering and applying sparse feature circuits. These are causally implicated subnetworks of human-interpretable features for explaining language model behaviors." https://t.co/bwc8YMrC2g

topofmlsafety's tweet photo. "We introduce methods for discovering and applying sparse feature circuits. These are causally implicated subnetworks of human-interpretable features for explaining language model behaviors."
https://t.co/bwc8YMrC2g https://t.co/bdXieXI7CL

0

12

1

5

1K

ML Safety Daily

@topofmlsafety

over 2 years ago

Vulnerability Detection with Code Language Models: How Far Are We? Exposes flaws in existing datasets for vulnerability LLMs, introduces a more accurate dataset, demonstrating that current models, including GPT-3.5 and GPT-4, perform poorly on it. https://t.co/umfbnbqIrC

topofmlsafety's tweet photo. Vulnerability Detection with Code Language Models: How Far Are We?

Exposes flaws in existing datasets for vulnerability LLMs, introduces a more accurate dataset, demonstrating that current models, including GPT-3.5 and GPT-4, perform poorly on it.
https://t.co/umfbnbqIrC https://t.co/8dTDpsGhSl

0

8

0

4

637

ML Safety Daily

@topofmlsafety

over 2 years ago

Jailbreaking is Best Solved by Definition Existing defenses against LLM jailbreaks fail; a successful defense must accurately define what constitutes unsafe outputs, with post-processing emerging as a robust solution given a good definition. https://t.co/6PjbjPYnjC

topofmlsafety's tweet photo. Jailbreaking is Best Solved by Definition

Existing defenses against LLM jailbreaks fail; a successful defense must accurately define what constitutes unsafe outputs, with post-processing emerging as a robust solution given a good definition.
https://t.co/6PjbjPYnjC https://t.co/Kq2aAwOP1i

0

6

1

8

871

ML Safety Daily

@topofmlsafety

Who to follow

Last Seen Users on Sotwe

Trends for you

Most Popular Users