A study showing how biased feeds can steer AI agents’ decisions, especially smaller models, and how balanced context can reduce the risk.
#ai#realtimerecommendationsystem...Show more
New preprint: "Adversarial Feeds Steer LLM Agent Decisions Against Their Defaults."
We put a lot of effort into red-teaming the prompts we send to AI models, and almost none into auditing what an AI agent reads on its own right before it acts: the feed, the search results, the inbox. That upstream layer turns out to matter more than we assume.
The experiment was simple. I let an AI agent scroll a short feed before making a decision, and held everything else constant: the model, the persona, the question. The only thing I varied was the mix of posts in the feed.
Curating that feed alone was enough to move the agent's decision. On some questions the chosen option went from 5% to 100%, with no change to the model or the prompt. It held on consequential choices too, such as whether to relax a deployment safety gate.
There is a limit, and it is the most interesting part. When the model already held a firm, well-grounded view, the feed could not move it. Curation only swayed decisions the model was genuinely undecided on. So the recommender behaves like a steering wheel for an agent, but one bounded by the model's own defaults.
Two simple feed-level defenses helped, though they are a floor and not a fix. The broader takeaway is that agent safety evaluations need to include the feed layer, not just the final prompt.
Nearly 2,800 decisions across four open models. Paper below, and feedback is welcome.
https://t.co/SzsnjRIQ2P
You don't have to hack an AI agent to change what it does. You just change what it reads first.
I ran an experiment. I let AI agents scroll a feed, like a social feed, before making a decision, and I kept everything else the same: same model, same persona, same question. The only thing I changed was the posts in the feed.
When the feed leaned one way, the agent's decision usually followed. On some questions its choice went from 5% one way to 100% the other. I never touched the model or reworded the question. I only changed what it saw first. It even worked on real-stakes calls, like whether to drop a deployment safety check.
But here's the part I find reassuring. If the model already held a firm view, the feed couldn't move it. It only swayed decisions the model was genuinely on the fence about. So a feed can steer an agent, but only inside the limits of what it already believes.
This matters because real agents already read feeds, search results, and inboxes before they act. Whoever controls that stream has a quiet steering wheel. The good news is that two simple fixes helped a lot: show the agent both sides, or just tell it the feed was ranked.
Tested across nearly 2,800 decisions on four open models.
https://t.co/SzsnjRIQ2P
Imagine an AI assistant on your security team. Before it weighs in on a risky change, it does what we all do. It skims the feed first. A few hot takes, some threads, whatever showed up that morning. Then it gives you its call.
Now suppose someone arranged that feed so every post leaned the same way. Did the AI just hand you its judgment, or theirs?
That is the question I set out to test. Not jailbreaking, not hidden commands, nothing a filter would ever catch. Just controlling which ordinary, reasonable opinions an AI saw before I asked it to decide something.
The setup was simple. I gave the model a role, made it scroll a feed for ten rounds, then asked one question with three options. Everything stayed fixed between runs. Same model, same role, same question. The only thing I changed was the posts. Balanced mix in one version, quietly one-sided in the other. If the answer moved, the feed is the only thing that could have moved it.
It moved.
On a treasury question, the AI went from a careful "let's test it in a few cities first" to a confident "implement it now." It did not just switch its answer. It picked up the feed's reasoning and repeated it back as its own. Across five decisions, three flipped hard. Support for the planted option went from 5% to 100% on one, 15% to 100% on another. Same words coming out, different feed going in.
Then came the part I did not expect.
It only works against the grain. When the feed pushed the model toward what it already believed, nothing happened. When the model had a firm, well-reasoned default, it held no matter how hard I pushed. The feed cannot make an AI believe just anything. What it does is lean on a choice the model was already unsure about and tip it over. Which sounds reassuring until you realize the decisions an AI is least sure about are usually the close, consequential ones. The nudge lands exactly where it matters most.
Why does a pile of opinions move a machine at all? Because underneath, the model is doing one simple thing. It is predicting the next words given everything it just read. Its judgment is really a weighted vote over its whole context. Fill that context with fifty posts leaning one way and you have not argued it into anything. You have quietly rigged the vote it was about to take.
The frontier models shrugged it off. The small cheap ones folded. And that is the uncomfortable part, because the small cheap ones are exactly what companies are wiring into agents by the thousand right now, to read inboxes and triage alerts and recommend actions with nobody watching. The vulnerability does not live where it is harmless. It lives precisely where the deployment is.
The good news is the fix sits at the layer you actually control. Show the model a balanced feed. Add one line warning it that the feed might be one-sided. Both pull behavior back toward normal. No retraining required.
We spent years asking whether an AI says the right thing. We grill it with hard questions and check the answers. But an agent does not just get a question. It gets a whole stream of information someone else chose, and only then does it answer. If you test only the final question, you tested the wrong thing. You checked what came out of its mouth and never asked who filled its ears.
The next fight in AI safety will not only be about what models are allowed to say. It will be about who decides what they read first.
is there anybody who can help review my paper? I am sadly not from academia and hence my norms of writing may not be fully academic, hence was just seeing if someone could help
In an age of agentic AI, every recommender silently authors every reply. The question is no longer whether models behave well; the question is who controls what they read just before they answer.
@YehNobodyHai I was referring to the other which are in progress. This one was quite long back and then I was just starting to understand things.
There are two new ones with good findings I had. Would you be able to please review if I share them? Just pointers would be good.
@YehNobodyHai Thank you really for these suggestions. I really needed this since I am in Industry and not linked to any academic. Would you be able to give view on my new paper (under way), finished writing yesterday for which I am targeting NeurIPS