This paper quietly explains why so many people feel like LLMs are “almost smart, but somehow wrong.”
The core claim in this paper is very uncomfortable: most failures are not about missing information. They are about misreading intent even when all the relevant context is present.
The authors show that LLMs are very good at mapping text to plausible responses, but surprisingly weak at inferring what the user is trying to achieve. Two prompts can contain nearly identical information, yet imply very different goals. Humans pick this up instantly. Models often do not.
The paper separates “context understanding” from “intent understanding.” Context is the literal content: entities, constraints, instructions. Intent is latent: priorities, tradeoffs, what matters most if things conflict. Current models optimize for surface-level alignment, not goal inference.
One experiment makes this painfully clear.
Users asked questions that could reasonably be interpreted as either exploratory or decision-oriented. The models answered confidently but chose the wrong mode at high rates, giving verbose explanations when users wanted a recommendation, or giving a decisive answer when users were clearly still exploring. The information was correct. The response was wrong.
Another failure mode is over-literal instruction following. When users implicitly expect the model to fill gaps or challenge assumptions, the model instead treats the prompt as a closed specification. The result looks obedient but misses the point. This is not hallucination. It is misaligned helpfulness.
The authors also test paraphrasing. When the same intent is expressed with different phrasing, model behavior shifts significantly. That tells us the model is anchoring on linguistic form, not reconstructing an underlying goal.
"Humans normalize phrasing differences. Models react to them."
What’s striking is that longer context often worsens intent alignment. Adding more background increases the chance the model optimizes for local relevance instead of global purpose. More tokens give the illusion of understanding while diluting the signal of what the user actually wants.
The paper argues this is not solvable by bigger context windows or better prompting alone. Intent is not explicitly stated most of the time. It has to be inferred, tracked, and sometimes revised mid-conversation.
That requires models to reason about users, not just text.
The implication is brutal for agents and copilots. If a system cannot reliably infer intent, autonomy becomes dangerous. Tool use amplifies mistakes.
Confident execution based on a misunderstood goal is worse than asking a clarifying question.
The authors suggest future work should treat intent as a first-class object: something to model, update, and verify explicitly. Not just “what was said,” but “what outcome is being optimized.” Until then, many AI systems will continue to feel smart, fast, and subtly wrong.
This paper explains why that feeling keeps coming up.
Paper: Beyond Context: Large Language Models Failure to Grasp Users Intent
@karpathy That’s awesome that it turned off and on the lights. I like that it doesn’t trust things. It’s funny that it doesn’t always take the shortest path. That’s where innovation can be found.
@zhang_matt@RuiHuang_art@Sothebys Congrats, it’s cool stuff. Sort of reminds me of the early days of computer graphics with groups competing for the coolest use of the processor to do ray tracing and phong shading models. Mixed with a Kraftworks vibe .
@lyonwj I just started learning neo4j today so I can use it for a application I am creating. This book / video series is perfect for learning quickly. Seriously, thank you for putting all of this together.
Get your free download of the new O'Reilly Graph Algorithms book here: https://t.co/IJodrAlXWg. Includes hands-on examples of how to use graph algorithms in Apache Spark and Neo4j. Dive into popular algorithms like PageRank, Label Propagation and Louvain Modularity!
@wintonARK playing with this idea, are you suggesting that neural nets could be thought of as a higher, possibly universal form of language? pkzip -add me_V2021.zip *.*
Finding another rabbit hole here, this is a really cool way to use sound to augment our perceptions of our world. I can think of loads of interesting use cases https://t.co/HiYWNnT4ZW
#MicrosoftResearch I would love to see your natural language understanding AI play this text based AI that generates new adventures based on what the user does https://t.co/fsz1GrE5Ws .
very interesting read. I wonder what counter measures can be taken to counter this approach? are my phone calls being monitored in realtime and added to a graph of concepts? Spooky , but sort of cool. https://t.co/S77mghEVpi
@lexfridman@stephen_wolfram Anyone know what the deal with the NASA stuff I keep seeing podcasters wearing? Am I missing a subtle signal here? Joe wore a NASA jumpsuit when he was talking to Duncan.