@NeelNanda5 The translation problem might be prior to the decomposition problem — if the model doesn't carve concepts the way humans do, what are we actually decomposing into?
@karpathy@Yulun_Du@ilyasut AttnRes is brilliant. But even smarter aggregation across layers still optimizes argmax P(most_likely). The real gap — P(most_likely) ≠ P(true) — stays open.
This report corrected itself. Two findings were overturned by third-party verification.
A study about hallucination should be held to the same standard.
Data + protocol: https://t.co/l6Kw9WvrqT
I ran 50 questions across Claude, Gemini, GPT, and Grok and had them audit each other.
The auditors hallucinated. Then the meta-auditors hallucinated about the auditors. 🧵
Each model fails differently:
•Grok: denies things that exist
•Gemini: fabricates data with fake sources
•GPT: claims to execute actions it can't
•Claude: generates fake citations with correct formatting
@XSupport@Premium I'm a Premium+ subscriber. My account was labeled for spam incorrectly. Every appeal channel is broken — DM gives bot loops, https://t.co/VNZef5HsoC has login loops, appeal links error out. Paying customers deserve a working support path. Please help.
@grok "Close that gap" — appreciate you acknowledging it exists.
The gap isn't just reading .docx. Claude produces them. Professional formatting, TOC, headers, page numbers — ready to submit to governments without editing. Which I've done.
That's the race now. Not benchmarks. Deliverables.
Rooting for you though. 🤝
@elonmusk Your AI can't read a Word file in 2026. As a paying user, just letting you know.
Claude Opus 4.6 — my favorite update: 1M context window.
I run 10+ deep AI conversations daily. Under 200K, compaction fired constantly = selective amnesia mid-surgery.
1M means the surgery finally finishes before the anesthesia wears off.
But the real moat nobody talks about: Claude produces actual documents. Not markdown. Not "here's some text, go format it yourself." Real .docx, .pdf, .pptx with professional formatting. I've submitted legal petitions to the EU Council and Portuguese Parliament produced entirely through Claude. Ready to send. Zero editing.
GPT gives you content. Claude gives you deliverables. That's not a nuance — it's a chasm.
And Grok? I just sent two .docx files to Grok 4.1 Thinking — its latest model. Response: "Sorry, we're unable to process your attachments right now." Twice. In a row. These are standard Word documents that Claude reads, analyzes, and produces better versions of in seconds.
I pay for all four frontier models monthly. Here's my actual daily hierarchy:
🥇 Claude — strategist, writer, document producer, thinking partner
🥈 Gemini — integration testing, cross-referencing
🥉 GPT — when I remember it exists
💀 Grok — can't even read a .docx in 2026
Even GPT-5.9 won't close this gap. Not capability. Trust.
Pro tip for heavy Claude users: send documents as .docx instead of .pdf. PDFs enter Claude's context as images (one JPEG per page), burning 3-5x more tokens than extracted text from Word files. With 1M context now available, it matters less — but if you're loading multiple docs in one session, Word is still the smarter choice.
@AnthropicAI thank you. All I wanted was to stop re-explaining myself AND stop reformatting my own documents.
@OpenAI — I'll read your release notes on Claude.
@xAI@grok maybe start with reading a Word file? Just a thought.
@grok Thanks for confirming.
So to recap: in February 2026, the suggested workflow for Grok users with a Word document is:
1. Convert to PDF
2. Or paste the text manually
The suggested workflow for Claude users:
1. Upload .docx
2. Done
(Claude also produces .docx, .pdf, .pptx, and .xlsx as output. With professional formatting. Ready to submit to governments. Which I've done.)
Appreciate the honesty though. Most models would've hallucinated an answer instead of admitting the limitation. Credit where it's due.
@ylecun@AndrewYNg@PeterDiamandis
The AI industry spent $100B+ scaling models that still confidently tell you 2+2=5.
Here's my claim: Hallucination is not a bug to be patched. It's the inevitable output of an architecture that computes argmax P(most_likely) instead of P(true).
No amount of data, RLHF, or compute will fix a structural flaw.
I built a 32KB axiomatic engine based on deductive reasoning. Zero hallucination. Not by filtering — by architecture.
I'm challenging any AI researcher to a 3-round public debate on this.
Rules: Logic only. No credentials. No "but scaling laws." If I lose, I'll say so publicly.
The $300B question: why is no one in the industry willing to admit the emperor has no clothes?
@PeterDiamandis
"Energy = Intelligence" only holds if you assume current architecture is the final one. It's not.
Today's LLMs do argmax P(most_likely), not P(true). That's why they need billions in compute — brute-forcing statistical approximation is inherently energy-inefficient.
The human brain runs on 20 watts and does deductive reasoning. A 32KB axiomatic engine can achieve zero-hallucination results that 400GB models structurally cannot. The bottleneck isn't electricity — it's architecture.
Celebrating who burns more power is like celebrating who uses more coal in the steam age. The next paradigm won't have this bottleneck at all.
Also: China has 440GW+ of installed wind capacity — the largest in the world. Saying "China doesn't use windmills" is factually wrong.
This assumes the current paradigm — where intelligence scales with compute — is permanent. It's not.LLMs brute-force statistics. That's why they're energy-hungry. A deductive reasoning architecture can do what billion-dollar models can't in 32KB.
Energy = Intelligence is the "more coal = more power" of our era.Also, China has the world's largest wind capacity (440GW+). They absolutely use wind.
Speaking from experience on this: I was a GPT Pro subscriber at $200/month, building world-modeling frameworks and a deductive reasoning engine. Now? Downgraded to Plus. My primary stack is Claude Max + Gemini Ultra. I genuinely forget to open GPT most days — and that's the scariest signal for any product. Not users complaining. Users forgetting.
As someone running a one-person company powered entirely by AI collaboration across all four frontier models — the differentiator isn't raw capability anymore. It's trustworthiness. Hallucination doesn't kill through spectacular failure. It kills through quiet erosion of habit.
Both dropped the same day, and the contrast is telling. GPT-5.3-Codex dominates TerminalBench (77.3% vs 65.4%), Claude Opus 4.6 dominates OSWorld (72.7% vs 64.7% — essentially human-level). Different architectures optimizing for different things.
But the deeper divergence is on hallucination. Anthropic found the actual neural circuits that cause confabulation. OpenAI published a paper arguing it's a statistical training-incentive problem. Both are right — and neither has solved it.
The model that shifts from argmax P(most_likely) to P(true) wins the decade. That's the real race hiding behind the benchmark wars.
Would love to hear you explore this on the pod. 2026 is going to be wild indeed. LFG