Anyone remember the Akinator? Why a 20-Year-Old App-Akinator Still Humiliates GPT-5?
Last night, I ran a simple experiment. In one corner, I had Akinator, the web-based “mind-reading” genie that went viral when the iPhone 1 was still new. It runs on a database that probably fits on a USB stick.
In the other corner, I had OpenAI’s GPT-5 (Turbo) and Google’s Gemini 3. These are trillion-parameter gods, trained on the sum of human knowledge, capable of writing symphonies and coding entire apps.
The Test: I thought of a character: Jinx from Arcane
The Result?
Akinator: Guessed it in 14 questions.
GPT-5: Needed 28 questions, hallucinated 4 times, and asked if the character was “real” three separate times.
How is this possible? How does a piece of “ancient” software outperform the most advanced intelligence humanity has ever built?
The answer isn’t just about games. It exposes a fundamental crisis in how we are building AI in 2025. It’s the battle between Probability and Entropy.
To understand why ChatGPT sucks at “20 Questions" you have to understand what it’s actually doing.
When you play a game with ChatGPT, it is roleplaying. It isn’t actually trying to solve the logic puzzle; it is predicting what a human playing the logic puzzle would say next.
When GPT asks, “Is your character female?”, it isn’t asking because it calculated that this question splits the remaining possibilities by 50%. It asks because, statistically, that is a common starting question in its training data.
The Genie’s Secret Weapon: “The Cut”
Akinator, on the other hand, is a cold, calculating sniper. It doesn’t care about grammar. It doesn’t care about being polite. It cares about one thing: Information Gain.
Every time you answer “Yes” or “No,” Akinator is slicing the universe of characters in half.
Question 1: “Is it real?” (Removes 50% of database).
Question 2: “Is it American?” (Removes another 40%).
By Question 10, it has mathematically cornered you. It hasn’t “guessed”; it has eliminated every other possibility until only one remains. This is a deterministic algorithm called a Decision Tree, fueled by a metric called Shannon Entropy.
ChatGPT cannot do this. Why? Because it has no memory of the “Database".
An LLM has no rigid list of “all characters in existence” to whittle down. It is generating tokens on the fly. It might ask, “Does your character have blue hair?” and if you say yes, it doesn’t cross off “all non-blue-haired characters” from a master list. It just nudges its internal probability vectors slightly toward “Anime” or “Gaming.”
It’s the difference between a library catalog system (Akinator) and a librarian who has had too much coffee and is guessing based on your outfit (ChatGPT).
Why This Matters for 2026
You might think, “Who cares? It’s a game.”
But this distinction is exactly why Agentic AI is hitting a wall right now. We are trying to force LLMs to do tasks that require rigid, deterministic logic (like accounting, law, or safety checks).
We are trying to use a poet to do a mathematician’s job.
@synthwavedd Nothing humbles you faster than aggressively refreshing your browser for twenty minutes only to realize you did not make the stealth-drop VIP list
@lisathebeauty1 My favorite corporate activity is having a full-blown adrenaline spike over an email marked "URGENT" only to realize it is just someone asking to change the font size on slide four
@jonathan_wilke As someone who enters a state of sheer panic the second a black terminal window accidentally opens on my screen, I fully support keeping the scary hacker boxes permanently closed
@jarrodwatts I am convinced that modern financial markets are just a massive multiplayer online game where the developers completely forgot to balance the tokenomics and the rest of us are simply pretending the math makes sense
@Senpaisaysbye I cannot wait to spend forty agonizing hours completely rewriting my entire tech stack just to migrate to a model that somehow costs less than a single chicken nugget
@rajshamani I ran the math and realized I only need to skip my daily iced latte for exactly four thousand years to afford a down payment, so yes, I am absolutely getting the extra espresso shot
Why the “God” (LLM) is trapped in a dream, why the “Genie” (Akinator) is a ruthless sniper, and how the Reasoning Engine (CoT) finally wakes the God up.
So, a 20-year-old app named Akinator guessed a character in 14 questions, while GPT-5 (a trillion-dollar superintelligence) stumbled, hallucinated, and failed.
It feels like the “God” is stupid. But it is not.
The problem is that we are judging a Dream Engine by the rules of a Logic Game.
To understand why the God fails where the Genie succeeds, we must look at the Thermodynamics of Intelligence.
We are dealing with two opposing physical laws: Entropy Maximization vs. Entropy Minimization.
Akinator is not “thinking.” Akinator is a sniper.
When you play the game, Akinator views the universe of characters as a single block of Information Entropy. Its only goal is to slice that block in half.
1⃣The Algorithm: It uses a Decision Tree (or a variation like a Bayesian Network) governed by Information Gain;
2⃣The Math: If it asks, ���Is your character female?” and you say “No”, the probability of “Hermione Granger” drops to exactly 0.000000;
3⃣The Result: The Genie cannot hallucinate. It operates in a binary reality of True or False. It has Zero Entropy. It is rigid, brittle, but mathematically flawless.
Now, look at the LLM. The LLM does not have a “Database” of characters. It has a Vibe.
Notice the math: The probability of a token is never zero.
1⃣The “Temperature” Problem: When you tell GPT-5, “The character is not female" it does not set the probability of Hermione Granger to 0. It sets it to 0.0001%;
2⃣Probabilistic Drift: As the conversation gets longer, these 0.0001% errors accumulate like thermal noise. This is Brownian Motion of thought. Eventually, the noise overcomes the signal, and the God hallucinates;
3⃣The Trade-off: The God trades Precision for Creativity. It can write a poem about Harry Potter (which the Genie cannot do), but it cannot logically isolate Harry Potter without “drifting.”
So, are we stuck? Must we choose between the stupid-but-precise Genie and the creative-but-hallucinating God?
No. In late 2024, a new architecture emerged: The Reasoning Engine (e.g., OpenAI o1, DeepSeek-R1).
This is the synthesis. It forces the God to pause and simulate the Genie.
How It Works (System 2 Thinking)
Instead of shouting the first answer that comes to mind (System 1), the model enters a Hidden Reasoning Loop.
1⃣The Thought: The LLM generates a hidden path: “I want to guess Harry Potter";
2⃣The Critic: It checks against the constraints (simulating the Genie): “Wait, the user said "No" to "Is he a wizard" in turn 3. Harry Potter contradicts this”;
3⃣The Backtrack: The model deletes that thought path and tries again.
This is Inference-Time Compute. We are spending computing power not to generate text, but to filter entropy.
The LLM is the Engine of Imagination. But imagination without discipline is madness.
1⃣Akinator is pure Discipline (Logic);
2⃣GPT-4 was pure Imagination (Vibes);
3⃣The Future (Neuro-Symbolic AI) is the Imagination of the God constrained by the Discipline of the Genie.
The “God” isn’t failing anymore. It just needed to learn how to check its work.
Typical coding day with Claude (Opus 4.8)
- explain to Claude the task (5 minutes)
- Claude implements task (10 minutes)
me: "Why is this necessary?"
Claude: "You're right to push back! I over-engineered this!"
- Repeat x87 times (13 hours)
Moby-Dick uses the em dash at roughly three times the modern human rate: Why AI Loves Em Dashes and Why Almost Every Explanation Is Wrong?
There’s a single punctuation mark that has quietly become the most reliable fingerprint of AI-generated writing.
It’s not a phrase like “delve into” or “in today’s fast-paced world.” It’s the humble em dash ( — ), that long horizontal line you’re looking at right now.
Em dashes have become so synonymous with chatbot output that human writers are abandoning them out of fear of being mistaken for bots. Editors say they show up in every third sentence of AI-written text. Researchers found em dash usage in scientific abstracts more than doubled between 2021 and 2025, almost exactly tracking the rise of ChatGPT.
Despite being one of the most identifiable quirks of modern AI prose, there’s no settled consensus on its cause.
The most convincing explanation I’ve come across came from engineer Sean Goedecke, who traced the habit back to the books these models were trained on. It’s a genuinely good theory. But instead of just relaying it, I wanted to do something I haven’t seen anyone do with it: pressure-test it myself, as someone who works with these models for a living.
So I stopped reading other people’s theories and started measuring. The experiments:
1) How dash-heavy are today’s models?
2) Does the em dash actually tokenize as cheaply as everyone claims?
So, Just How Bad Is It?
Try this experiment. Open ChatGPT and ask it to write something without using em dashes. Then watch as it cheerfully ignores you.
There’s an entire thread on the OpenAI forums dedicated to users sharing their failed attempts to wrestle this punctuation mark out of the model’s responses. One Reddit moderator trying to “de-AI” his writing put it bluntly:
“Even when I prohibit em-dashes at the level of the system prompt, the LLM keeps inserting them into text".
In November 2025, OpenAI CEO Sam Altman announced on X that custom instructions to avoid em dashes would finally be respected. User responses suggest the fix is leaky at best; screenshots circulated almost immediately of ChatGPT apologizing for using an em dash in the very same response where it promised not to.
My Experiments
Experiment 1: How dash-heavy are models, actually?
Everyone says AI overuses em dashes. Almost nobody puts a number on it. So I did the boring thing and counted it across models, against a human baseline, with a script anyone can rerun.
The method is deliberately dull, but dull is what makes it trustworthy. I gave each model the same three open-ended writing prompts (a short blog on “The impact of AI on the job market”), collected roughly 2,000 words of output from each, and ran every sample through the same counter. No cherry-picking, no editing. The metric is em dashes per 1,000 words, which normalizes for length, plus the same figure as a percentage of all words, so it lines up with the published human baselines.
The counter catches both forms of em dashes, the typographic em dash (—) that models emit and the -- convention used in older plain-text sources, so I could measure a 19th-century novel and a 2026 chatbot on identical terms:
import re
def em_density(text):
words = re.findall(r"\b[\w']+\b", text)
n = max(len(words), 1)
em = text.count("\u2014") + len(re.findall(r"(?<!-)--(?!-)", text))
return {"words": len(words), "em_dashes": em,
"per_1000": round(em / n * 1000, 2),
"pct_words": round(em / n * 100, 3)}
First, the two anchors. The modern human baseline for em dashes sits around 0.25–0.275% of all words in general English. Now the fun one, I pulled the full text of Moby-Dick and ran it through the exact same counter. Melville’s novel contains 1,712 em dashes across 216,000 words: 0.79% of every word he wrote, or about 7.9 per 1,000 words. That’s the headline the print-book theory hangs on, made concrete: Moby-Dick uses the em dash at roughly three times the modern human rate. The 1860s really were a different punctuation universe.
I expected the models to bunch together somewhere above the human line. They didn’t. They scattered across the entire range, and that scatter is the actual finding.
Claude out-dashes Melville. At nearly 12 per 1,000 words, Claude 4.8 Opus uses the em dash half again as often as Moby-Dick and roughly four and a half times as often as a modern human writer. If you wanted a single model to blame for the “AI loves em dashes” reputation, it isn’t the one the meme is named after. It’s this one.
Because here’s the twist: GPT-4o — the model whose name became shorthand for em-dash abuse landed below the human baseline. (Big asterisk: that’s a small sample and GPT-4o is now a legacy model, so I’m treating it as provisional, not gospel.) Gemini sat politely in the middle, about twice the human rate.
That spread matters more than any single number, and it quietly complicates the print-book theory. If the em dash habit were simply baked in by a shared corpus of 19th-century books, every model trained on roughly the same internet-plus-books diet should cluster around the same rate. They don’t. The training corpus may load the gun, that’s the origin story from the tokenizer experiment, but each lab’s post-training is what pulls the trigger, and they’re clearly pulling it with very different force. House style, RLHF preferences, and reward-model taste are doing more of the steering than the raw data is.
Experiment 2: Does the em dash actually save tokens?
One of the tidier theories floating around is that AI overuses the em dash because it’s efficient: one token instead of the three you’d spend on “, and.” It sounds plausible, and I wanted to verify it. So I opened a tokenizer and ran the connectives head-to-head, once in OpenAI’s cl100k_base encoding (GPT-3.5, GPT-4) and again in o200k_base (GPT-4o, GPT-4.1, the GPT-5 family).
The shallow version of the theory is technically true: a bare em dash is a single token, and “, and” is three. But that comparison is rigged. Drop the em dash into a real sentence, and the edge collapses to almost nothing.
“The model was fast — it never hesitated". → 10 tokens “The model was fast, and it never hesitated.” → 11 tokens
One token saved. I tested three more clause pairs and got the same result every time: the em dash buys you exactly one token over the wordy “X, and Y” construction, and zero over a humble comma. Both are two tokens once you count the trailing space. The em dash is tied with the comma and beats only the most verbose phrasing it could replace.
So efficiency cannot be the origin story. A one-token shortcut, available only against the clunkiest alternative, does not turn into a 10x usage jump between GPT-3.5 and GPT-4o on its own. If token economics were really steering the model’s hand, it would be ruthless about commas everywhere else, and it isn’t.
But here’s where it stops being a dead end and gets interesting. A model trains by minimizing loss one token at a time, and it’s rewarded during RLHF for prose that reads smooth and confident. The em dash is the rare move that is both: it’s a low-surprise, high-probability continuation the model has already over-learned from its training text, and it happens to be the cheapest fluent way to weld two clauses together. A vanishingly small advantage, sure. But multiply a vanishingly small advantage across billions of training steps and a reward signal that quietly likes the result, they compound.
The efficiency theory isn’t so wrong. It’s just not where the habit comes from; it’s part of why it sticks. Which means the real question of where a 10x spike comes from in the first place is still open.
Whats really up with em dashes?
GPT-3.5 didn’t overuse em dashes. GPT-4o used roughly 10x more. GPT-4.1 used even more than that. Anthropic’s Claude, Google’s Gemini, and even open-source Chinese models all developed the same habit. Something changed between November 2022 and July 2024.
That something appears to be the training data itself.
In 2022, OpenAI was likely training on a mix of public internet content and pirated books from sites like LibGen. But once everyone realized just how powerful these models could be, AI labs raced to find higher-quality training material. That meant scanning vast quantities of physical print books. Court filings reveal Anthropic began this process in February 2024, and OpenAI almost certainly did the same.
Now connect that with this fact: “a study of English punctuation found em dash usage peaked around 1860”, at roughly 0.35% of all words, about 30% higher than modern usage.
Pirated books skew toward modern bestsellers; that’s what people download. But when AI labs went looking for more high-quality text, they had to go further back. Older books are more likely to be in the public domain. Older books are also more likely to have been digitized for academic and archival purposes. And those older books are absolutely loaded with em dashes.
For scale: Moby-Dick contains a staggering 1,712 em dashes.
This is the theory that actually fits the evidence: state-of-the-art models lean heavily on late-1800s and early-1900s print books for high-quality training data, and those books are saturated with em dashes. The habit is so hard to train out because the models learned English from sources that were full of them.
But Why is LLM Reinforcing this?
Even if the print-book theory explains the origin, several other forces are now keeping the habit alive and possibly making it worse.
Em dashes feel polished:
LLMs are tuned to sound authoritative and refined. Em dashes have always been a fixture of carefully edited prose like magazines, literary fiction, and journalism. OpenAI’s team has even admitted to having a soft spot for the em dash. During RLHF, human raters consistently reward outputs that feel clear and well-structured, which often means em-dash-heavy.
Models can’t self-edit
Human writers revise drafts. They notice they’ve used five em dashes in a paragraph and dial it back. LLMs generate text in a single pass, with no global editing perspective. They optimize locally, not holistically. Once a pattern is in the weights, it just keeps coming.
The feedback
Newer AI models are now training partly on the output of older AI models, either through deliberate use of synthetic data or by accidentally vacuuming up AI-generated content from the web. As more of the internet fills with em-dash-heavy AI text, the next generation of models inherits the habit more strongly.
This is the early shape of what researchers call model collapse, where new models amplify the quirks of their predecessors until something breaks. For people who hate AI writing, that’s almost a feature. For the rest of us, it means AI prose might get worse before it gets better.
Humans are catching the habit
A study tracking em dashes in scientific abstracts found human writers more than doubled their em dash usage between 2021 and 2025. We’re not just noticing AI’s dash habit. We’re picking it up ourselves. And that material will become training data for future models, deepening the loop.
The Detection Paradox
Using em dashes to detect AI writing is fundamentally flawed.
Em dashes have been around for generations, from Emily Dickinson to modern journalists. They’ve been used in magazines like The New Yorker. And yet, when we accuse AI of using too many em dashes, we’re essentially accusing AI of imitating good writing.
Some writers are avoiding em dashes to seem more human, giving up a punctuation mark they might love just to avoid suspicion. Others are leaning deeper into them, refusing to let AI dictate their style. We’ve become a strange sort of punctuation arms race: a tiny horizontal line has somehow become a frontline for authenticity
So Should You Stop Using EmDashes?
Honestly, that’s your choice.
You could give the em dash to bots and give up a punctuation mark you may genuinely love just to avoid being lumped with chatbots. But you could refuse. Continue to use em dashes as Dickinson and Melville and your favorite essayist always did, and hope the rest of your writing will carry the human message anyway.
Because punctuation isn’t what makes writing human. What makes writing human is everything an AI can’t fake, like your specific point of view, your lived experience, your willingness to be wrong, the texture of the things you actually noticed.
@Nexuist Nothing screams peak San Francisco quite like actively refusing to build home equity just so you can comfortably afford to abandon fourteen $12 smoothies across the city every single morning
@jeremyberman Considering it costs $180 per million output tokens, they are probably just trying to save us from accidentally bankrupting ourselves the very first time an autonomous coding loop gets permanently stuck trying to center a div.
@haydendevs I am convinced that the only actual use case for half of these autonomous agent frameworks is just giving us something brand new to aggressively troubleshoot until 3 AM
@AlexanderKnigge I honestly have absolutely no idea anymore, but I am already leaving out tiny saucers of milk by my WiFi router just to be completely safe
@benjamin_horne I am dying at the realization that the entire future of global AI policy basically hinges on whether or not your company sends a guy named Brad in a Patagonia vest to buy a senator a steak
@hqmank I am not mentally prepared to do a high-definition biometric facial scan at 3 AM just to have Claude politely inform me that I forgot a semicolon