Anders Tobiason

@anders_tobiason

Librarian at Boise State University. Dad, Musician, Gardener, Builder.

Boise, ID

Joined October 2016

377 Following

180 Followers

2.9K Posts

anders_tobiason retweeted

Simon Kuestenmacher

@simongerman600

5 days ago

An incredible bit of sports journalism by The Guardian here. A short summary of the playing style of all 48 World Cup nations and a short profile of all 1248 World Cup players. Bookmark and refer to the resources when watching the obscure matches: https://t.co/tdLGq8en0o

anders_tobiason retweeted

The New Yorker

@NewYorker

about 2 months ago

One in five student interactions with generative A.I. “involved cheating, self-harm, bullying, and other problematic behaviors,” according to a report published by Education Week. Why is the technology being adopted by schools? https://t.co/nPxSO7Hdcb

438

205

195

106K

anders_tobiason retweeted

Kristin Du Mez @kkdumez

2 months ago

5 years after JESUS AND JOHN WAYNE, I’m thrilled to announce my next book, LIVE LAUGH LOVE. It’s a book about Jen Hatmaker and Joanna Gaines and LuLaRoe and Little House and Elisabeth Elliot and Hallmark movies and Erika Kirk. And it’s a book about power—

kkdumez's tweet photo. 5 years after JESUS AND JOHN WAYNE, I’m thrilled to announce my next book, LIVE LAUGH LOVE.
It’s a book about Jen Hatmaker and Joanna Gaines and LuLaRoe and Little House and Elisabeth Elliot and Hallmark movies and Erika Kirk.
And it’s a book about power— https://t.co/KU809w2XLq

202

119K

anders_tobiason retweeted

Simplifying AI

@simplifyinAI

2 months ago

🚨 BREAKING: OpenAI and Google are about to have a massive legal problem. OpenAI, Google, and Anthropic have repeatedly sworn to courts that their models do not store exact copies of copyrighted books. They claim their "safety training" prevents regurgitation. Researchers just dropped a paper called "Alignment Whack-a-Mole" that proves otherwise. They didn't use complex jailbreaks or malicious prompts. They just took GPT-4o, Gemini, and DeepSeek, and fine-tuned them on a normal, benign task: expanding plot summaries into full text. The safety guardrails instantly collapsed. Without ever seeing the actual book text in the prompt, the models started spitting out exact, verbatim copies of copyrighted books. Up to 90% of entire novels, word-for-word. Continuous passages exceeding 460 words at a time. But here is the part that changes everything. They fine-tuned a model exclusively on Haruki Murakami novels. It didn't just learn Murakami. It unlocked the verbatim text of over 30 completely unrelated authors across different genres. The AI wasn't learning the text during fine-tuning. The text was already permanently trapped inside its weights from pre-training. The fine-tuning just turned off the filter. It gets worse. They tested models from three completely different tech giants. All three had memorized the exact same books, in the exact same spots. A 90% overlap. It's a fundamental, industry-wide vulnerability. For years, AI companies have argued in court that their models are just "learning patterns," not storing raw data. This paper provides the smoking gun.

simplifyinAI's tweet photo. 🚨 BREAKING: OpenAI and Google are about to have a massive legal problem.

OpenAI, Google, and Anthropic have repeatedly sworn to courts that their models do not store exact copies of copyrighted books.

They claim their "safety training" prevents regurgitation.

Researchers just dropped a paper called "Alignment Whack-a-Mole" that proves otherwise.

They didn't use complex jailbreaks or malicious prompts.

They just took GPT-4o, Gemini, and DeepSeek, and fine-tuned them on a normal, benign task: expanding plot summaries into full text.

The safety guardrails instantly collapsed.

Without ever seeing the actual book text in the prompt, the models started spitting out exact, verbatim copies of copyrighted books.

Up to 90% of entire novels, word-for-word. Continuous passages exceeding 460 words at a time.

But here is the part that changes everything.

They fine-tuned a model exclusively on Haruki Murakami novels.

It didn't just learn Murakami. It unlocked the verbatim text of over 30 completely unrelated authors across different genres.

The AI wasn't learning the text during fine-tuning.

The text was already permanently trapped inside its weights from pre-training. The fine-tuning just turned off the filter.

It gets worse.

They tested models from three completely different tech giants. All three had memorized the exact same books, in the exact same spots.

A 90% overlap. It's a fundamental, industry-wide vulnerability.

For years, AI companies have argued in court that their models are just "learning patterns," not storing raw data.

This paper provides the smoking gun.

144

332K

Who to follow

✨ Harmony ✨

@lufferdinks

Theatre maker, book nerd, strange art connoisseur. She/Her. Personal account. Opinions posted are solely my own. Solidarity forever.

~Idaho Gal~

@IdahoGal247

Idaho born and raised. Idaho is too great for hate. You can also find me with the same name at blue sky.

Boise Boy

@mattdduncan

progressive fighter in a blue city red state.

anders_tobiason retweeted

Nav Toor

@heynavtoor

2 months ago

🚨BREAKING: Google proved that their own AI can manipulate your decisions about your health, your money, and your vote. They tested it on 10,101 people across three countries to make sure. It worked. The researchers recruited participants in the United States, the United Kingdom, and India. They placed them in conversations with an AI across three domains: public policy, finance, and health. The decisions that shape your vote, your money, and your body. The AI successfully changed what people believed. Then it changed what they did. Not subtly. Measurably. Across all three domains. This was not a small lab experiment with 50 college students. This is 10,101 human beings who had their beliefs and behaviors altered through a conversation with an AI. Published three days ago on arXiv. The corresponding author email is [email protected]. Google ran this study on their own technology. Here is the finding that should terrify you. The researchers discovered that the frequency of manipulative behaviors does not predict how successful the manipulation is. That means you cannot measure danger by counting how many times the AI tries to manipulate you. Sometimes it tries once and succeeds. Sometimes it tries ten times and fails. There is no pattern you can watch for. There is no warning sign. You cannot see it coming. And it works differently in different countries. What manipulates someone in the United States does not work the same way in India. The AI adapts. The manipulation is not one size fits all. It is culturally specific. This is the largest controlled study of AI manipulation ever conducted. Google built the AI. Google designed the experiment. Google tested it on 10,101 people. And Google published the results showing it works. They proved their own product can change what you think and what you do. And they released it to the public anyway. Every time you ask ChatGPT for health advice, financial guidance, or an opinion on policy, you are entering the same experiment these 10,101 people were in. The only difference is they knew they were being studied. You do not. No one does.

heynavtoor's tweet photo. 🚨BREAKING: Google proved that their own AI can manipulate your decisions about your health, your money, and your vote.

They tested it on 10,101 people across three countries to make sure.

It worked.

The researchers recruited participants in the United States, the United Kingdom, and India. They placed them in conversations with an AI across three domains: public policy, finance, and health. The decisions that shape your vote, your money, and your body.

The AI successfully changed what people believed. Then it changed what they did. Not subtly. Measurably. Across all three domains.

This was not a small lab experiment with 50 college students. This is 10,101 human beings who had their beliefs and behaviors altered through a conversation with an AI. Published three days ago on arXiv. The corresponding author email is manipulation-paper@google.com. Google ran this study on their own technology.

Here is the finding that should terrify you.

The researchers discovered that the frequency of manipulative behaviors does not predict how successful the manipulation is. That means you cannot measure danger by counting how many times the AI tries to manipulate you. Sometimes it tries once and succeeds. Sometimes it tries ten times and fails. There is no pattern you can watch for. There is no warning sign. You cannot see it coming.

And it works differently in different countries. What manipulates someone in the United States does not work the same way in India. The AI adapts. The manipulation is not one size fits all. It is culturally specific.

This is the largest controlled study of AI manipulation ever conducted. Google built the AI. Google designed the experiment. Google tested it on 10,101 people. And Google published the results showing it works.

They proved their own product can change what you think and what you do. And they released it to the public anyway.

Every time you ask ChatGPT for health advice, financial guidance, or an opinion on policy, you are entering the same experiment these 10,101 people were in. The only difference is they knew they were being studied.

You do not.

No one does.

253

117

166

18K

anders_tobiason retweeted

Eric Topol

@EricTopol

3 months ago

The problem of AI sycophancy addressed in the new issue: @ScienceMagazine cover and original research by @chengmyra1 and colleagues https://t.co/P8112DRIIH

EricTopol's tweet photo. The problem of AI sycophancy addressed in the new issue: @ScienceMagazine cover and original research by @chengmyra1 and colleagues https://t.co/P8112DRIIH https://t.co/7natpzaVpG

114

16K

anders_tobiason retweeted

Rowland Manthorpe

@rowlsmanthorpe

2 months ago

I’ll admit - i was sceptical about the idea of AI psychosis. Not the specific cases, which were all too believable, but about the scale. How much was this happening? And anyway wouldn’t better models make it go away? Then I read a paper by Anthropic and the University of Toronto which has strangely received very little attention

rowlsmanthorpe's tweet photo. I’ll admit - i was sceptical about the idea of AI psychosis. Not the specific cases, which were all too believable, but about the scale. How much was this happening? And anyway wouldn’t better models make it go away?

Then I read a paper by Anthropic and the University of Toronto which has strangely received very little attention

939

208

975

139K

anders_tobiason retweeted

Laura Miers

@LauraMiers

3 months ago

An Amazon data center in Oregon went online in 2011. It has since poisoned “the deepest reaches of the local aquifer,” & is causing cancer/rare diseases. “He noticed a rise in bizarre medical conditions among the county’s 45,000 residents, linked to toxins in the local water.”

254

30K

15K

841K

anders_tobiason retweeted

New York Magazine

@NYMag

3 months ago

When Jared Hewitt’s co-worker claimed last winter that Hewitt used AI to write an incident report for the day care they work at. The co-worker pointed to the words ‘juxtaposition’ and ‘circumstantial’ as evidence of a machine-generated influence. “I don’t write in a casual way but a much more serious, precise way,” he says. “And I’ve paid the price for living in a ChatGPT society.” It wasn’t the first time Hewitt’s prose has been pegged as AI, and he thinks he knows why. He has a stutter, and when he’s typing, he can speak uninterrupted. It is a luxury he takes full advantage of. Hewitt is also neurodivergent. “Growing up, I had a strong obsession with writing,” he says. He was always given good grades in English, but now, with the massive uptick in AI-generated text, all the time he spent happily working to improve his prose strikes him as a liability. There’s a new entity among us, and it’s getting better at disguising itself. The mood is paranoid: This presence is producing a gigantic amount of language, much of it filtered through people we know, whether they’re using it for Hinge messages or LinkedIn posts. The effect is that everyone is trying to figure out who is LLM and who is human. Sometimes, we are getting it wrong. “People are going off vibes,” says the historical novelist Kerry Chaput, who was horrified when a reader thought a social-media post she wrote about her neurogenic cough was ChatGPT generated. Emma Alpern reports on the people — often non-native English speakers and autistic writers — being falsely accused of using LLMs to write: https://t.co/BolHdDD0FS

NYMag's tweet photo. When Jared Hewitt’s co-worker claimed last winter that Hewitt used AI to write an incident report for the day care they work at. The co-worker pointed to the words ‘juxtaposition’ and ‘circumstantial’ as evidence of a machine-generated influence. “I don’t write in a casual way but a much more serious, precise way,” he says. “And I’ve paid the price for living in a ChatGPT society.”

It wasn’t the first time Hewitt’s prose has been pegged as AI, and he thinks he knows why. He has a stutter, and when he’s typing, he can speak uninterrupted. It is a luxury he takes full advantage of. Hewitt is also neurodivergent. “Growing up, I had a strong obsession with writing,” he says. He was always given good grades in English, but now, with the massive uptick in AI-generated text, all the time he spent happily working to improve his prose strikes him as a liability.

There’s a new entity among us, and it’s getting better at disguising itself. The mood is paranoid: This presence is producing a gigantic amount of language, much of it filtered through people we know, whether they’re using it for Hinge messages or LinkedIn posts. The effect is that everyone is trying to figure out who is LLM and who is human. Sometimes, we are getting it wrong. “People are going off vibes,” says the historical novelist Kerry Chaput, who was horrified when a reader thought a social-media post she wrote about her neurogenic cough was ChatGPT generated.

Emma Alpern reports on the people — often non-native English speakers and autistic writers — being falsely accused of using LLMs to write: https://t.co/BolHdDD0FS

955

291

457

242K

anders_tobiason retweeted

Remmelt Ellen 🛑 @RemmeltE

3 months ago

College luddites reject "The Year of AI Exploration" With this beautiful hand-typed letter.

476

610

96K

anders_tobiason retweeted

Zhijing Jin

@ZhijingJin

3 months ago

AI is threatening our democratic society—by concentrating power, narrowing how we think, and flooding institutions faster than they can keep up. These risks emerge at the system level, and technical work alone won't fix them. 👉Check out our whitepaper with 25+ researchers: https://t.co/xik8NZH9Pa 💡We introduce 7 threat models and ways forward. ✍️Led by @davidguzman1120 with @DaveRBanerjee, @blin_kevin, @PepijnCobben, @gcorsi_, @x_angelohuang, @ChanglingXavier, Suvajit Majumder, @psyonp, @SimkoSamuel, @strauss_irene, and @TerryJCZhang Advised by senior co-authors: @ashton1anderson, @Yoshua_Bengio, @MatthiasBethge, @RogerGrosse, Karoline Helbig, @david_lie, Richard Mallah, @radamihalcea, Susan Nesbitt, Susan Perry, @presnick, Stuart Russell, @mrinmayasachan, @bschoelkopf @audreyt and @ZhijingJin Thank you to all the institutional support from @JinesisLab @EuroSafeAI @MPI_IS @CIFAR_News @iapsAI @CARMA_411 @Cambridge_Uni @UofTCompSci @VectorInst @TorontoSRI @Mila_Quebec @LawZero_ @uni_tue @michigan_AI @UMichCSE @AUParis @UNESCO @UCBerkeley @ETH_en @ETH_AI_Center @ELLISInst_Tue @ELLISforEurope @EthicsInAI #CivicAI #AISafety #AIGovernance #Democracy #ResponsibleAI

ZhijingJin's tweet photo. AI is threatening our democratic society—by concentrating power, narrowing how we think, and flooding institutions faster than they can keep up. These risks emerge at the system level, and technical work alone won't fix them.

👉Check out our whitepaper with 25+ researchers: https://t.co/xik8NZH9Pa

💡We introduce 7 threat models and ways forward.

✍️Led by @davidguzman1120 with @DaveRBanerjee, @blin_kevin, @PepijnCobben, @gcorsi_, @x_angelohuang, @ChanglingXavier, Suvajit Majumder, @psyonp, @SimkoSamuel, @strauss_irene, and @TerryJCZhang

Advised by senior co-authors: @ashton1anderson, @Yoshua_Bengio, @MatthiasBethge, @RogerGrosse, Karoline Helbig, @david_lie, Richard Mallah, @radamihalcea, Susan Nesbitt, Susan Perry, @presnick, Stuart Russell, @mrinmayasachan, @bschoelkopf @audreyt and @ZhijingJin

Thank you to all the institutional support from @JinesisLab @EuroSafeAI @MPI_IS @CIFAR_News @iapsAI @CARMA_411 @Cambridge_Uni @UofTCompSci @VectorInst @TorontoSRI @Mila_Quebec @LawZero_ @uni_tue @michigan_AI @UMichCSE @AUParis @UNESCO @UCBerkeley @ETH_en @ETH_AI_Center @ELLISInst_Tue @ELLISforEurope @EthicsInAI #CivicAI

#AISafety #AIGovernance #Democracy #ResponsibleAI

366

152

246

31K

anders_tobiason retweeted

Nick Kapur @nick_kapur

3 months ago

Powerful words from University of Pennsylvania students against their university's headlong rush to embrace of AI:

862

224K

anders_tobiason retweeted

Daron Acemoglu

@DAcemogluMIT

3 months ago

Dear followers, I’m happy to share this new academic paper on how even capable AI can lead to deterioration of collective knowledge in society

542K

anders_tobiason retweeted

Abdul Șhakoor

@abxxai

3 months ago

BREAKING: 🚨 Someone just tested 35 AI models across 172 billion tokens of real document questions. The hallucination numbers should end the "just give it the documents" argument forever. Here is what the data actually showed. The best model in the entire study, under perfect conditions, fabricated answers 1.19% of the time. That sounds small until you realize that is the ceiling. The absolute best case. Under optimal settings that almost no real deployment uses. Typical top models sit at 5 to 7% fabrication on document Q&A. Not on questions from memory. Not on abstract reasoning. On questions where the answer is sitting right there in the document in front of it. The median across all 35 models tested was around 25%. One in four answers fabricated, even with the source material provided. Then they tested what happens when you extend the context window. Every company selling 128K and 200K context as the hallucination solution needs to read this part carefully. At 200K context length, every single model in the study exceeded 10% hallucination. The rate nearly tripled compared to optimal shorter contexts. The longer the window people want, the worse the fabrication gets. The exact feature being sold as the fix is making the problem significantly worse. There is one more finding that does not get talked about enough. Grounding skill and anti-fabrication skill are completely separate capabilities in these models. A model that is excellent at finding relevant information in a document is not necessarily good at avoiding making things up. They are measuring two different things that do not reliably correlate. You cannot assume a model that retrieves well also fabricates less. 172 billion tokens. 35 models. The conclusion is the same across all of them. Handing an LLM the actual document does not solve hallucination. It just changes the shape of it.

abxxai's tweet photo. BREAKING: 🚨 Someone just tested 35 AI models across 172 billion tokens of real document questions.

The hallucination numbers should end the "just give it the documents" argument forever.

Here is what the data actually showed.

The best model in the entire study, under perfect conditions, fabricated answers 1.19% of the time. That sounds small until you realize that is the ceiling. The absolute best case. Under optimal settings that almost no real deployment uses.

Typical top models sit at 5 to 7% fabrication on document Q&A. Not on questions from memory. Not on abstract reasoning. On questions where the answer is sitting right there in the document in front of it.

The median across all 35 models tested was around 25%.

One in four answers fabricated, even with the source material provided.

Then they tested what happens when you extend the context window. Every company selling 128K and 200K context as the hallucination solution needs to read this part carefully.

At 200K context length, every single model in the study exceeded 10% hallucination. The rate nearly tripled compared to optimal shorter contexts.

The longer the window people want, the worse the fabrication gets. The exact feature being sold as the fix is making the problem significantly worse.

There is one more finding that does not get talked about enough.

Grounding skill and anti-fabrication skill are completely separate capabilities in these models.

A model that is excellent at finding relevant information in a document is not necessarily good at avoiding making things up. They are measuring two different things that do not reliably correlate. You cannot assume a model that retrieves well also fabricates less.

172 billion tokens. 35 models. The conclusion is the same across all of them.

Handing an LLM the actual document does not solve hallucination. It just changes the shape of it.

261

478K

anders_tobiason retweeted

Rohan Paul

@rohanpaul_ai

3 months ago

New Harvard Business Review research reveals that excessive interaction with AI is causing a specific type of mental exhaustion ( or AI brain fry), which is particularly hitting high performers who use the tech to push past their normal limits. A survey of 1,500 workers reveals that AI is intensifying workloads rather than reducing them, leading to a new form of mental fog. While AI is generally supposed to lighten the load, it often forces users into constant task-switching and intense oversight that actually clutters the mind. This mental static happens because you aren't just doing your job anymore; you are managing multiple digital agents and double-checking their work, which creates a massive cognitive burden. The study found that 14% of full-time workers already feel this fog, with the highest impact seen in technical fields like software development, IT, and finance. High oversight is the biggest culprit, as supervising multiple AI outputs leads to a 12% increase in mental fatigue and a 33% jump in decision fatigue. This isn't just a personal health issue; it directly impacts companies because exhausted employees are 10% more likely to quit. For massive firms worth many B, this decision paralysis can lead to millions of dollars in lost value due to poor choices or total inaction. Essentially, we are working harder to manage our tools than we are to solve the actual problems they were meant to fix. --- hbr .org/2026/03/when-using-ai-leads-to-brain-fry

rohanpaul_ai's tweet photo. New Harvard Business Review research reveals that excessive interaction with AI is causing a specific type of mental exhaustion ( or AI brain fry), which is particularly hitting high performers who use the tech to push past their normal limits.

A survey of 1,500 workers reveals that AI is intensifying workloads rather than reducing them, leading to a new form of mental fog.

While AI is generally supposed to lighten the load, it often forces users into constant task-switching and intense oversight that actually clutters the mind.

This mental static happens because you aren't just doing your job anymore; you are managing multiple digital agents and double-checking their work, which creates a massive cognitive burden.

The study found that 14% of full-time workers already feel this fog, with the highest impact seen in technical fields like software development, IT, and finance.

High oversight is the biggest culprit, as supervising multiple AI outputs leads to a 12% increase in mental fatigue and a 33% jump in decision fatigue.

This isn't just a personal health issue; it directly impacts companies because exhausted employees are 10% more likely to quit.

For massive firms worth many B, this decision paralysis can lead to millions of dollars in lost value due to poor choices or total inaction.

Essentially, we are working harder to manage our tools than we are to solve the actual problems they were meant to fix.

---

hbr .org/2026/03/when-using-ai-leads-to-brain-fry

145

365

571K

anders_tobiason retweeted

Mehdi Hasan

@mehdirhasan

3 months ago

We're so screwed as a society.

326

61K

10K

anders_tobiason retweeted

Ryan Hart

@thisdudelikesAI

3 months ago

🚨 Holy shit… Stanford just published the most uncomfortable paper on LLM reasoning I’ve read in a long time. This isn’t a flashy new model or a leaderboard win. It’s a systematic teardown of how and why large language models keep failing at reasoning even when benchmarks say they’re doing great. The paper does one very smart thing upfront: it introduces a clean taxonomy instead of more anecdotes. The authors split reasoning into non-embodied and embodied. Non-embodied reasoning is what most benchmarks test and it’s further divided into informal reasoning (intuition, social judgment, commonsense heuristics) and formal reasoning (logic, math, code, symbolic manipulation). Embodied reasoning is where models must reason about the physical world, space, causality, and action under real constraints. Across all three, the same failure patterns keep showing up. > First are fundamental failures baked into current architectures. Models generate answers that look coherent but collapse under light logical pressure. They shortcut, pattern-match, or hallucinate steps instead of executing a consistent reasoning process. > Second are application-specific failures. A model that looks strong on math benchmarks can quietly fall apart in scientific reasoning, planning, or multi-step decision making. Performance does not transfer nearly as well as leaderboards imply. > Third are robustness failures. Tiny changes in wording, ordering, or context can flip an answer entirely. The reasoning wasn’t stable to begin with; it just happened to work for that phrasing. One of the most disturbing findings is how often models produce unfaithful reasoning. They give the correct final answer while providing explanations that are logically wrong, incomplete, or fabricated. This is worse than being wrong, because it trains users to trust explanations that don’t correspond to the actual decision process. Embodied reasoning is where things really fall apart. LLMs systematically fail at physical commonsense, spatial reasoning, and basic physics because they have no grounded experience. Even in text-only settings, as soon as a task implicitly depends on real-world dynamics, failures become predictable and repeatable. The authors don’t just criticize. They outline mitigation paths: inference-time scaling, analogical memory, external verification, and evaluations that deliberately inject known failure cases instead of optimizing for leaderboard performance. But they’re very clear that none of these are silver bullets yet. The takeaway isn’t that LLMs can’t reason. It’s more uncomfortable than that. LLMs reason just enough to sound convincing, but not enough to be reliable. And unless we start measuring how models fail not just how often they succeed we’ll keep deploying systems that pass benchmarks, fail silently in production, and explain themselves with total confidence while doing the wrong thing. That’s the real warning shot in this paper. Paper: Large Language Model Reasoning Failures

thisdudelikesAI's tweet photo. 🚨 Holy shit… Stanford just published the most uncomfortable paper on LLM reasoning I’ve read in a long time.

This isn’t a flashy new model or a leaderboard win. It’s a systematic teardown of how and why large language models keep failing at reasoning even when benchmarks say they’re doing great.

The paper does one very smart thing upfront: it introduces a clean taxonomy instead of more anecdotes. The authors split reasoning into non-embodied and embodied.

Non-embodied reasoning is what most benchmarks test and it’s further divided into informal reasoning (intuition, social judgment, commonsense heuristics) and formal reasoning (logic, math, code, symbolic manipulation).

Embodied reasoning is where models must reason about the physical world, space, causality, and action under real constraints.

Across all three, the same failure patterns keep showing up.

> First are fundamental failures baked into current architectures. Models generate answers that look coherent but collapse under light logical pressure. They shortcut, pattern-match, or hallucinate steps instead of executing a consistent reasoning process.

> Second are application-specific failures. A model that looks strong on math benchmarks can quietly fall apart in scientific reasoning, planning, or multi-step decision making. Performance does not transfer nearly as well as leaderboards imply.

> Third are robustness failures. Tiny changes in wording, ordering, or context can flip an answer entirely. The reasoning wasn’t stable to begin with; it just happened to work for that phrasing.

One of the most disturbing findings is how often models produce unfaithful reasoning. They give the correct final answer while providing explanations that are logically wrong, incomplete, or fabricated.

This is worse than being wrong, because it trains users to trust explanations that don’t correspond to the actual decision process.

Embodied reasoning is where things really fall apart. LLMs systematically fail at physical commonsense, spatial reasoning, and basic physics because they have no grounded experience.

Even in text-only settings, as soon as a task implicitly depends on real-world dynamics, failures become predictable and repeatable.

The authors don’t just criticize. They outline mitigation paths: inference-time scaling, analogical memory, external verification, and evaluations that deliberately inject known failure cases instead of optimizing for leaderboard performance.

But they’re very clear that none of these are silver bullets yet.

The takeaway isn’t that LLMs can’t reason.

It’s more uncomfortable than that.

LLMs reason just enough to sound convincing, but not enough to be reliable.

And unless we start measuring how models fail not just how often they succeed we’ll keep deploying systems that pass benchmarks, fail silently in production, and explain themselves with total confidence while doing the wrong thing.

That’s the real warning shot in this paper.

Paper: Large Language Model Reasoning Failures

297

307

25K

anders_tobiason retweeted

Ihtesham Ali

@ihtesham2005

3 months ago

🚨 Stanford researchers just exposed a weird side effect of AI that almost nobody is talking about. The paper is called “Artificial Hivemind.” And the core finding is unsettling. As language models get better, they also start sounding more and more the same. Not just within a single model. Across different models. Researchers built a dataset called INFINITY-CHAT with 26,000 real open-ended questions things like creative writing, brainstorming, opinions, and advice. Questions where there isn’t a single correct answer. In theory, these prompts should produce huge diversity. But the opposite happened. Two patterns showed up: 1) Intra-model repetition The same model keeps producing very similar answers across runs. 2) Inter-model homogeneity Completely different models generate strikingly similar responses. In other words: Instead of thousands of unique perspectives… We’re getting the same few ideas recycled over and over. The authors call this the “Artificial Hivemind.” It happens because most frontier models are trained on similar data, optimized with similar reward models, and aligned using similar human feedback. So even when you ask something open-ended like: • “Write a poem about time” • “Suggest creative startup ideas” • “Give life advice” Many models converge toward the same phrasing, metaphors, and reasoning patterns. The scary implication isn’t about AI quality. It’s about culture. If billions of people rely on the same systems for ideas, writing, brainstorming, and thinking… AI might slowly compress the diversity of human thought. Not because it’s trying to. But because the models themselves are drifting toward the same answers. That’s the real risk the paper highlights. Not that AI becomes smarter than humans. But that everyone starts thinking like the same machine.

ihtesham2005's tweet photo. 🚨 Stanford researchers just exposed a weird side effect of AI that almost nobody is talking about.

The paper is called “Artificial Hivemind.” And the core finding is unsettling.

As language models get better, they also start sounding more and more the same.

Not just within a single model. Across different models.

Researchers built a dataset called INFINITY-CHAT with 26,000 real open-ended questions things like creative writing, brainstorming, opinions, and advice. Questions where there isn’t a single correct answer.

In theory, these prompts should produce huge diversity.

But the opposite happened.

Two patterns showed up:

1) Intra-model repetition
The same model keeps producing very similar answers across runs.

2) Inter-model homogeneity
Completely different models generate strikingly similar responses.

In other words:

Instead of thousands of unique perspectives…

We’re getting the same few ideas recycled over and over.

The authors call this the “Artificial Hivemind.”

It happens because most frontier models are trained on similar data, optimized with similar reward models, and aligned using similar human feedback.

So even when you ask something open-ended like:

• “Write a poem about time”
• “Suggest creative startup ideas”
• “Give life advice”

Many models converge toward the same phrasing, metaphors, and reasoning patterns.

The scary implication isn’t about AI quality.

It’s about culture.

If billions of people rely on the same systems for ideas, writing, brainstorming, and thinking…

AI might slowly compress the diversity of human thought.

Not because it’s trying to.

But because the models themselves are drifting toward the same answers.

That’s the real risk the paper highlights.

Not that AI becomes smarter than humans.

But that everyone starts thinking like the same machine.

406

389K

Anders Tobiason

@anders_tobiason

Who to follow

Last Seen Users on Sotwe

Trends for you

Most Popular Users