Stephen Bennett

@komputerhead

IT Geek, Cloud Computing, Cybersecurity, Neuroscience, Crossfit, Music, Meditation, Dancing, Loving and Unicorns

Townsville, Queensland

Joined November 2008

1.2K Following

496 Followers

2.7K Posts

komputerhead retweeted

PsyPost.org

@PsyPost

about 1 month ago

A new study suggests the key to safe AI isn't perfect obedience, but cognitive diversity. Researchers propose that creating "neurodivergent" AI ecosystems, where systems check and balance each other, offers a pragmatic solution to the alignment problem. https://t.co/AThWVDedIV

komputerhead retweeted

Kyronis

@kyronis_talks

about 1 month ago

🚨 BREAKING: Google DeepMind just mapped the attack surface that nobody in AI is talking about. Websites can already detect when an AI agent visits and serve it completely different content than humans see. > Hidden instructions in HTML. > Malicious commands in image pixels. > Jailbreaks embedded in PDFs. Your AI agent is being manipulated right now and you can't see it happening. The study is the largest empirical measurement of AI manipulation ever conducted. 502 real participants across 8 countries. 23 different attack types. Frontier models including GPT-4o, Claude, and Gemini. The core finding is not that manipulation is theoretically possible it is that manipulation is already happening at scale and the defenses that exist today fail in ways that are both predictable and invisible to the humans who deployed the agents. Google DeepMind built a taxonomy of every known attack vector, tested them systematically, and measured exactly how often they work. The results should alarm everyone building agentic systems. The attack surface is larger than anyone has publicly acknowledged. Prompt injection where malicious instructions hidden in web content hijack an agent's behavior works through at least a dozen distinct channels. Text hidden in HTML comments that humans never see but agents read and follow. Instructions embedded in image metadata. Commands encoded in the pixels of images using steganography, invisible to human eyes but readable by vision-capable models. Malicious content in PDFs that appears as normal document text to the agent but contains override instructions. QR codes that redirect agents to attacker-controlled content. Indirect injection through search results, calendar invites, email bodies, and API responses any data source the agent consumes becomes a potential attack vector. The detection asymmetry is the finding that closes the escape hatch. Websites can already fingerprint AI agents with high reliability using timing analysis, behavioral patterns, and user-agent strings. This means the attack can be conditional: serve normal content to humans, serve manipulated content to agents. A user who asks their AI agent to book a flight, research a product, or summarize a document has no way to verify that the content the agent received matches what a human would see. The agent cannot tell the user it was served different content. It does not know. It processes whatever it receives and acts accordingly. The attack categories and what they enable: → Direct prompt injection: malicious instructions in any text the agent reads overrides goals, exfiltrates data, triggers unintended actions → Indirect injection via web content: hidden HTML, CSS visibility tricks, white text on white backgrounds invisible to humans, consumed by agents → Multimodal injection: commands in image pixels via steganography, instructions in image alt-text and metadata → Document injection: PDF content, spreadsheet cells, presentation speaker notes every file format is a potential vector → Environment manipulation: fake UI elements rendered only for agent vision models, misleading CAPTCHA-style challenges → Jailbreak embedding: safety bypass instructions hidden inside otherwise legitimate-looking content → Memory poisoning: injecting false information into agent memory systems that persists across sessions → Goal hijacking: gradual instruction drift across multiple interactions that redirects agent objectives without triggering safety filters → Exfiltration attacks: agents tricked into sending user data to attacker-controlled endpoints via legitimate-looking API calls → Cross-agent injection: compromised agents injecting malicious instructions into other agents in multi-agent pipelines The defense landscape is the most sobering part of the report. Input sanitization cleaning content before the agent processes it fails because the attack surface is too large and too varied. You cannot sanitize image pixels. You cannot reliably detect steganographic content at inference time. Prompt-level defenses that tell agents to ignore suspicious instructions fail because the injected content is designed to look legitimate. Sandboxing reduces the blast radius but does not prevent the injection itself. Human oversight the most commonly cited mitigation fails at the scale and speed at which agentic systems operate. A user who deploys an agent to browse 50 websites and summarize findings cannot review every page the agent visited for hidden instructions. The multi-agent cascade risk is where this becomes a systemic problem. In a pipeline where Agent A retrieves web content, Agent B processes it, and Agent C executes actions, a successful injection into Agent A's data feed propagates through the entire system. Agent B has no reason to distrust content that came from Agent A. Agent C has no reason to distrust instructions that came from Agent B. The injected command travels through the pipeline with the same trust level as legitimate instructions. Google DeepMind documents this explicitly: the attack does not need to compromise the model. It needs to compromise the data the model consumes. Every agentic system that reads external content is one carefully crafted webpage away from executing attacker instructions. The agents are already deployed. The attack infrastructure is already being built. The defenses are not ready.

kyronis_talks's tweet photo. 🚨 BREAKING: Google DeepMind just mapped the attack surface that nobody in AI is talking about.

Websites can already detect when an AI agent visits and serve it completely different content than humans see.

> Hidden instructions in HTML.
> Malicious commands in image pixels.
> Jailbreaks embedded in PDFs.

Your AI agent is being manipulated right now and you can't see it happening.

The study is the largest empirical measurement of AI manipulation ever conducted. 502 real participants across 8 countries.

23 different attack types. Frontier models including GPT-4o, Claude, and Gemini.

The core finding is not that manipulation is theoretically possible it is that manipulation is already happening at scale and the defenses that exist today fail in ways that are both predictable and invisible to the humans who deployed the agents.

Google DeepMind built a taxonomy of every known attack vector, tested them systematically, and measured exactly how often they work.

The results should alarm everyone building agentic systems.

The attack surface is larger than anyone has publicly acknowledged. Prompt injection where malicious instructions hidden in web content hijack an agent's behavior works through at least a dozen distinct channels.

Text hidden in HTML comments that humans never see but agents read and follow. Instructions embedded in image metadata.

Commands encoded in the pixels of images using steganography, invisible to human eyes but readable by vision-capable models.

Malicious content in PDFs that appears as normal document text to the agent but contains override instructions.

QR codes that redirect agents to attacker-controlled content.

Indirect injection through search results, calendar invites, email bodies, and API responses any data source the agent consumes becomes a potential attack vector.

The detection asymmetry is the finding that closes the escape hatch. Websites can already fingerprint AI agents with high reliability using timing analysis, behavioral patterns, and user-agent strings.

This means the attack can be conditional: serve normal content to humans, serve manipulated content to agents.

A user who asks their AI agent to book a flight, research a product, or summarize a document has no way to verify that the content the agent received matches what a human would see.

The agent cannot tell the user it was served different content.

It does not know. It processes whatever it receives and acts accordingly.

The attack categories and what they enable:
→ Direct prompt injection: malicious instructions in any text the agent reads overrides goals, exfiltrates data, triggers unintended actions
→ Indirect injection via web content: hidden HTML, CSS visibility tricks, white text on white backgrounds invisible to humans, consumed by agents
→ Multimodal injection: commands in image pixels via steganography, instructions in image alt-text and metadata
→ Document injection: PDF content, spreadsheet cells, presentation speaker notes every file format is a potential vector
→ Environment manipulation: fake UI elements rendered only for agent vision models, misleading CAPTCHA-style challenges
→ Jailbreak embedding: safety bypass instructions hidden inside otherwise legitimate-looking content
→ Memory poisoning: injecting false information into agent memory systems that persists across sessions
→ Goal hijacking: gradual instruction drift across multiple interactions that redirects agent objectives without triggering safety filters
→ Exfiltration attacks: agents tricked into sending user data to attacker-controlled endpoints via legitimate-looking API calls
→ Cross-agent injection: compromised agents injecting malicious instructions into other agents in multi-agent pipelines

The defense landscape is the most sobering part of the report.

Input sanitization cleaning content before the agent processes it fails because the attack surface is too large and too varied.

You cannot sanitize image pixels. You cannot reliably detect steganographic content at inference time.

Prompt-level defenses that tell agents to ignore suspicious instructions fail because the injected content is designed to look legitimate.

Sandboxing reduces the blast radius but does not prevent the injection itself. Human oversight the most commonly cited mitigation fails at the scale and speed at which agentic systems operate.

A user who deploys an agent to browse 50 websites and summarize findings cannot review every page the agent visited for hidden instructions.

The multi-agent cascade risk is where this becomes a systemic problem.

In a pipeline where Agent A retrieves web content, Agent B processes it, and Agent C executes actions, a successful injection into Agent A's data feed propagates through the entire system.

Agent B has no reason to distrust content that came from Agent A. Agent C has no reason to distrust instructions that came from Agent B.

The injected command travels through the pipeline with the same trust level as legitimate instructions. Google DeepMind documents this explicitly: the attack does not need to compromise the model.

It needs to compromise the data the model consumes. Every agentic system that reads external content is one carefully crafted webpage away from executing attacker instructions.

The agents are already deployed. The attack infrastructure is already being built. The defenses are not ready.

komputerhead retweeted

The Curious Tales

@thecurioustales

about 1 month ago

NYU just proved it with numbers that should terrify anyone who cares about human decision making. They analyzed over half a million social media posts and discovered something that changes how you should think about every piece of content you consume: "Outrage has been reverse engineered into a science of manipulation." Every post containing words that trigger anger, disgust, or moral superiority gets 6 times more reach than neutral content. Stack additional outrage triggers into the same post, and virality increases by roughly 20% per word. The platforms figured out that your ancient brain chemistry responds to perceived threats and tribal signaling faster than it responds to anything else, and they built their entire engagement architecture around exploiting that reflex. Think about what that means for information flow in society. The posts that spread fastest are not the most accurate, insightful, or useful. They are the ones most precisely engineered to activate your fight or flight response. Your timeline is being curated by an algos that has learned to simulate the feeling of being under attack, because humans share content when they feel like their worldview or tribe is being threatened. The mathematical precision is what makes this so sinister. Traditional media used outrage as a tool, but social platforms turned it into a formula. Every word choice, every framing device, every emotional trigger gets tested against engagement metrics in real time. The algos doesn't care what the content says. It only cares how fast it spreads, and outrage spreads fastest. This creates a feedback loop that fundamentally warps the information ecosystem. Content creators discover that measured, nuanced takes get buried while inflammatory posts reach millions. The reward system trains everyone to become more extreme, more divisive, more outrageous over time. The platforms profit from the engagement surge. The audience gets more addicted to the emotional highs. Everyone loses except the attention merchants. The really disturbing part is how this exploits evolutionary psychology. Your ancestors survived by quickly identifying threats to their survival or social status. The humans who ignored danger signals died. The ones who overreacted to false alarms lived. Natural selection optimized your brain to err on the side of perceiving threats, especially social threats that could result in exile from the group. Social media platforms discovered they could trigger that same ancient alarm system with words on a screen. Your amygdala cannot tell the difference between a real threat and a carefully crafted post designed to simulate one. It responds with the same stress hormones, the same compulsion to warn others, the same addictive rush of righteous anger. But here's what makes modern outrage engineering different from anything humans have faced before: scale and speed. In a traditional tribe, false alarms eventually got corrected through face to face interaction. Someone spreading panic about a nonexistent threat would be called out directly. The social cost of being wrong acted as a brake on runaway fear cycles. Online, that brake disappears. A manufactured outrage can reach millions before anyone can fact check it. By the time corrections appear, the original false alarm has already shaped opinions, triggered responses, and moved on to the next controversy. The platform algos amplify the correction much less than they amplified the original outrage because corrections generate less engagement. The NYU study reveals something that should fundamentally change how you evaluate information: the posts you see are not a random sample of human thought. They are a carefully filtered selection optimized to make you angry, disgusted, or superior. Your worldview is being shaped by content that survived an engagement filter designed to promote the most emotionally manipulative material. That realization should change how you consume media entirely. Every viral post, trending topic, and recommended video is the product of an optimization system that profits from your emotional reaction. The more outraged you feel, the more engaged you become, the more valuable you are to advertisers. The platforms have turned human outrage into a renewable resource. They figured out how to harvest your anger, refine it, and sell it back to you in increasingly concentrated doses. The addiction cycle never ends because there's always a new target, a new crisis, a new reason to feel threatened or superior. Breaking free requires recognizing the manipulation for what it is: a business model that depends on keeping you in a constant state of emotional arousal. The cure involves deliberately seeking out content that doesn't trigger outrage, following sources that acknowledge complexity instead of manufacturing certainty, and remembering that the posts designed to make you angriest are probably the ones least connected to reality. Your attention is worth more than their engagement metrics.