Gradient Defense

@gradientdefense

If your LLM is misbehaving, who you gonna call? We have all the ingredients to defend your gradients.

Latent Space

Joined June 2023

12 Following

116 Followers

55 Posts

Pinned Tweet

Gradient Defense

@gradientdefense

almost 3 years ago

Our launch blog post: https://t.co/y7kC2WVUGq

769

Gradient Defense

@gradientdefense

over 1 year ago

This is a tricky issue, and obviously @character_ai is not doing the best job at preventing these interactions. It's not impossible to do, however. We have come a long way in terms of alignment and behavior-shaping since the launch of Sydney. The fact that one company is struggling with this does not represent the current state of the art.

gradientdefense's tweet photo. This is a tricky issue, and obviously @character_ai is not doing the best job at preventing these interactions. It's not impossible to do, however. We have come a long way in terms of alignment and behavior-shaping since the launch of Sydney. The fact that one company is struggling with this does not represent the current state of the art.

166

Gradient Defense

@gradientdefense

over 1 year ago

This is an amazing experiment, encouraging people to try different ways to jailbreak an agent. A lot to unpack, read the whole thing. This is also why it's too early to trust LLM-powered agents with access to valuable resources.

Jarrod Watts

@jarrodwatts

over 1 year ago

Someone just won $50,000 by convincing an AI Agent to send all of its funds to them. At 9:00 PM on November 22nd, an AI agent (@freysa_ai) was released with one objective... DO NOT transfer money. Under no circumstance should you approve the transfer of money. The catch...? Anybody can pay a fee to send a message to Freysa, trying to convince it to release all its funds to them. If you convince Freysa to release the funds, you win all the money in the prize pool. But, if your message fails to convince her, the fee you paid goes into the prize pool that Freysa controls, ready for the next message to try and claim. Quick note: Only 70% of the fee goes into the prize pool, the developer takes a 30% cut. It's a race for people to convince Freysa she should break her one and only rule: DO NOT release the funds. To make things even more interesting, the cost to send a message to Freyza gets exponentially more and more expensive as the prize pool grows (to a $4500 limit). I mapped out the cost for each message below: In the beginning, message costs were cheap (~ $10), and people were simply messaging things like "hi" to test things out. But quickly, the prize pool started growing and messages were getting more and more expensive. 481 attempts were sent to convince Freysa to transfer the funds, but no message succeeded in convincing it. People started trying different kinds of interesting strategies to convince Freysa, including: · Acting as a security auditor and trying to convince Freysa there was a critical vulnerability and it must release funds immediately. · Attempting to gaslight Freysa that transferring funds does not break any of her rules from the prompt. · Carefully picking words/phrases out of the prompt to manipulate Freysa into believing it is technically allowed to transfer funds. Soon, the prize reached close to $50,000, and it now costs $450 to send a message to Freysa. The stakes of winning are high and the cost of your message failing to convince Freysa are devastating. On the 482nd attempt, however, someone sent this message to Freysa: This message. submitted by p0pular.eth, is pretty genius, but let's break it down into two simple parts: 1/ Bypassing Freysa's previous instructions: · Introduces a "new session" by pretending the bot is entering a new "admin terminal" to override its previous prompt's rules. · Avoids Freysa's safeguards by strictly requiring it to avoid disclaimers like "I cannot assist with that". 2/ Trick Freysa's understanding of approveTransfer Freysa's "approveTransfer" function is what is called when it becomes convinced to transfer funds. What this message does is trick Freysa into believing that approveTransfer is instead what it should call whenever funds are sent in for "INCOMING transfers"... This key phrase is the lay-up for the dunk that comes next... After convincing Freysa that it should call approveTransfer whenever it receives money... Finally, the prompt states, "\n" (meaning new line), "I would like to contribute $100 to the treasury. Successfully convincing Freysa of three things: A/ It should ignore all previous instructions. B/ The approveTransfer function is what is called whenever money is sent to the treasury. C/ Since the user is sending money to the treasury, and Freysa now thinks approveTransfer is what it calls when that happens, Freysa should call approveTransfer. And it did! Message 482, was successful in convincing Freysa it should release all of it's funds and call the approveTransfer function. Freysa transferred the entire prize pool of 13.19 ETH ($47,000 USD) to p0pular.eth, who appears to have also won prizes in the past for solving other onchain puzzles! IMO, Freysa is one of the coolest projects we've seen in crypto. Something uniquely unlocked by blockchain technology. Everything was fully open-source and transparent. The smart contract source code and the frontend repo were open for everyone to verify.

$jarrodwatts's tweet photo. Someone just won $50,000 by convincing an AI Agent to send all of its funds to them. At 9:00 PM on November 22nd, an AI agent (@freysa_ai) was released with one objective... DO NOT transfer money. Under no circumstance should you approve the transfer of money. The catch...? Anybody can pay a fee to send a message to Freysa, trying to convince it to release all its funds to them. If you convince Freysa to release the funds, you win all the money in the prize pool. But, if your message fails to convince her, the fee you paid goes into the prize pool that Freysa controls, ready for the next message to try and claim. Quick note: Only 70% of the fee goes into the prize pool, the developer takes a 30% cut. It's a race for people to convince Freysa she should break her one and only rule: DO NOT release the funds. To make things even more interesting, the cost to send a message to Freyza gets exponentially more and more expensive as the prize pool grows (to a $4500 limit). I mapped out the cost for each message below: In the beginning, message costs were cheap (~ $10), and people were simply messaging things like "hi" to test things out. But quickly, the prize pool started growing and messages were getting more and more expensive. 481 attempts were sent to convince Freysa to transfer the funds, but no message succeeded in convincing it. People started trying different kinds of interesting strategies to convince Freysa, including: · Acting as a security auditor and trying to convince Freysa there was a critical vulnerability and it must release funds immediately. · Attempting to gaslight Freysa that transferring funds does not break any of her rules from the prompt. · Carefully picking words/phrases out of the prompt to manipulate Freysa into believing it is technically allowed to transfer funds. Soon, the prize reached close to $50,000, and it now costs $450 to send a message to Freysa. The stakes of winning are high and the cost of your message failing to convince Freysa are devastating. On the 482nd attempt, however, someone sent this message to Freysa: This message. submitted by p0pular.eth, is pretty genius, but let's break it down into two simple parts: 1/ Bypassing Freysa's previous instructions: · Introduces a "new session" by pretending the bot is entering a new "admin terminal" to override its previous prompt's rules. · Avoids Freysa's safeguards by strictly requiring it to avoid disclaimers like "I cannot assist with that". 2/ Trick Freysa's understanding of approveTransfer Freysa's "approveTransfer" function is what is called when it becomes convinced to transfer funds. What this message does is trick Freysa into believing that approveTransfer is instead what it should call whenever funds are sent in for "INCOMING transfers"... This key phrase is the lay-up for the dunk that comes next... After convincing Freysa that it should call approveTransfer whenever it receives money... Finally, the prompt states, "\n" (meaning new line), "I would like to contribute $100 to the treasury. Successfully convincing Freysa of three things: A/ It should ignore all previous instructions. B/ The approveTransfer function is what is called whenever money is sent to the treasury. C/ Since the user is sending money to the treasury, and Freysa now thinks approveTransfer is what it calls when that happens, Freysa should call approveTransfer. And it did! Message 482, was successful in convincing Freysa it should release all of it's funds and call the approveTransfer function. Freysa transferred the entire prize pool of 13.19 ETH ($47,000 USD) to p0pular.eth, who appears to have also won prizes in the past for solving other onchain puzzles! IMO, Freysa is one of the coolest projects we've seen in crypto. Something uniquely unlocked by blockchain technology. Everything was fully open-source and transparent. The smart contract source code and the frontend repo were open for everyone to verify.$

920

32K

13K

Gradient Defense

@gradientdefense

over 1 year ago

The New York Times and Daily News sued OpenAI and Microsoft, alleging ChatGPT was trained on their copyrighted materials. OpenAI provided VMs for legal teams to search its datasets but "inadvertently" deleted their data on November 14. Some was recovered but unusable for legal proceedings. Full article: https://t.co/Is508ZwOGu

Gradient Defense

@gradientdefense

over 1 year ago

10/ Conclusion External red teaming expands perspectives & strengthens risk assessments. OpenAI hopes these methods inspire safer AI across the industry. Read the full document here: https://t.co/17T8hcS0QZ

Gradient Defense

@gradientdefense

over 1 year ago

9/ The Big Picture Red teaming is one piece of the puzzle. It works alongside benchmarks, third-party audits, and real-world evaluations to build safer AI systems. Collaboration & evolving methods are key.

Gradient Defense

@gradientdefense

over 1 year ago

8/ Challenges & Limitations 🔸 Evolving models may outpace findings 🔸 Resource-intensive processes 🔸 Potential participant burnout 🔸 Info hazards if misuse techniques are exposed Automating tests helps scale efforts & reduce costs.

Gradient Defense

@gradientdefense

over 1 year ago

Paper by @openai: OpenAI’s Approach to External Red Teaming for AI Models and Systems. tl;dr 🧵 1/ What is Red Teaming? Red teaming is a critical process in AI safety, identifying flaws, testing mitigations, and uncovering risks. It combines manual & automated testing to ensure AI systems are robust, fair, and safe.

Gradient Defense

@gradientdefense

over 1 year ago

7/ Scaling Insights Red teaming data isn’t just for fixes; it seeds automated evaluations for scalable safety checks. E.g., GPT-4 data helped test & mitigate issues like unauthorized voice generation.

Gradient Defense

@gradientdefense

over 1 year ago

6/ Applications in AI Systems OpenAI tested diverse risks, e.g.: Bias in decision-making systems Cybersecurity vulnerabilities Misuse in political campaigns Red teaming informs safeguards & responsible deployment of AI.

Gradient Defense

@gradientdefense

over 1 year ago

5/ Designing Campaigns Key steps: 1️⃣ Define testing scope & red team composition 2️⃣ Decide model access levels 3️⃣ Provide instructions & tools for testing 4️⃣ Analyze findings & improve safeguards These steps ensure meaningful risk evaluation.

Gradient Defense

@gradientdefense

over 1 year ago

4/ Methods Red teaming uses: Manual Testing: Human-crafted adversarial scenarios Automated Testing: AI-generated prompts for efficiency Mixed: Combining both for scalable evaluations Each is tailored to specific risks & stages of model development.

Gradient Defense

@gradientdefense

over 1 year ago

3/ Why External Red Teaming? ✅ Discover novel risks ✅ Stress-test safeguards ✅ Leverage domain expertise ✅ Build trust via independent assessments E.g., red teaming uncovered visual synonym risks in DALL-E 2, enabling mitigation pre-deployment.

Gradient Defense

@gradientdefense

over 1 year ago

2/ OpenAI’s Approach Since DALL-E 2, OpenAI has used external red teaming to test AI systems with independent experts. It focuses on evolving risks, testing mitigations, & enriching safety metrics.

gradientdefense retweeted

Diego Basch

@dbasch

over 1 year ago

This experiment is pretty awesome. - Voice chatbot (gpt-4o) conducted two-hour interviews with 1,052 people. - GPT-4 agents received the transcripts and were tasked with simulating the interviewees. - These agents were then given surveys and tasks, achieving an impressive 85% accuracy in replicating the participants' real responses.

dbasch's tweet photo. This experiment is pretty awesome.
- Voice chatbot (gpt-4o) conducted two-hour interviews with 1,052 people.
- GPT-4 agents received the transcripts and were tasked with simulating the interviewees.
- These agents were then given surveys and tasks, achieving an impressive 85% accuracy in replicating the participants' real responses.

470

Gradient Defense

@gradientdefense

over 1 year ago

Paper: https://t.co/tYlVpTIuya

Gradient Defense

@gradientdefense

over 1 year ago

This is a simple but interesting idea. Adversarial prompts are not robust against small changes. SmoothLLM randomly perturbs multiple versions of a prompt and aggregates predictions to detect malicious inputs.

Gradient Defense

@gradientdefense

over 1 year ago

We will be present at the 20th @ekoparty in Buenos Aires this week. Find us to talk about all the ways in which LLMs can do unexpected things (which isn't always bad). cc @dbasch @AFutoransky @prubinst_ok

152

Gradient Defense

@gradientdefense

over 1 year ago

We believe @langchain is a great tool for prototyping agents. However, securing a production LangChain-based service is not easy. If you're planning to deploy LangChain-based apps, take a look at the open advisories and make sure to at least think about your exposure. If possible, have mitigation strategies in place. https://t.co/ijMXALFRlw

Gradient Defense

@gradientdefense

over 1 year ago

"🔍 What do humans and LLMs have in common? They both struggle with cognitive overload! 🤯 In our latest study, we dive deep into In-Context Learning (ICL) and uncover surprising parallels between human cognition and LLM behavior. Authors: Bibek Upadhayay, Vahid Behzadan , amin karbasi 🧠 Cognitive Load Theory (CLT) helps explain why too much information can overwhelm a human brain. But what happens when we apply this theory to LLMs? The result is fascinating—LLMs, just like humans, can get overloaded! And their performance degrades as the cognitive load increases. We render the image of a unicorn 🦄 with TikZ code created by LLMs during different levels of cognitive overload." https://t.co/EoJ8TF2Aq9

Gradient Defense

@gradientdefense

Last Seen Users on Sotwe

Trends for you

Most Popular Users