IMO — this paper misses the core driver of hallucinations
A LLM with a billion neurons is like a billion tiny databases — database per neuron
When you prompt it, the LLM looks in all the databases (i.e. neurons) for patterns it recognizes
For example, when you prompt "Kim Kardashian is dating ..."
The LLM looks in its billions of little hash tables and pulls out patterns:
- vocabulary (words like Kim, instagram, etc.)
- grammar (subjects -> verb -> object)
- semantics (Kim Kardashian's known associates)
But here's the problem.... when you prompt it for something unfamiliar, the LLM still recognizes some patterns (e.g. good grammar)
- vocabulary (words like Kim, instagram, etc.)
- grammar (subjects -> verb -> object)
But if it doesn't find all the right cache entries:
- semantics (Kim Kardashian's known associates)
- date ranges (maybe she dated different people at different times)
Then the LLM will make next-token predictions based on the hash-hits it found... but without the benefit of the hash-misses it lacks.
So to return to the prompt: "Kim Kardashian is dating ..."
- Grammar patterns: the next token will be a noun
- Semantic patterns: the next token will be a first name (because "is dating" is usually followed by a name)
- Gender pattern: the next token will be a male
- Relationship patterns: the next token will be a male Kim is associated with a lot
... but if it can't find the hash-hit in its internal neuraons for the SPECIFIC male she's dating... it can hit on other things.... like
- generic male names
- males who appear in articles with Kim
- other grammatically correct words like "no-one"
We call this a hallucination, but IMO it's closer to a cache miss.
So how do you solve hallucination?
This paper from OpenAI suggests that we solve hallucination by putting "I don't know" in a bunch of the databases.
But this isn't how you solve for cache misses — this is just how you create more cache hits of a certain type.
If you had a database which was returning erroneous results, would you *fill* the database with "I don't know" entries???...
On the one hand, that WOULD increase the chances that the erroneous result was "I don't know"... so you'd make some partial progress at a surface level.
But IMO it's not solving the underlying problem... which is closer to detecting the sources/datapoints used for each prediction (MoE, RAG, etc. are making progress on this).
IMO - a more fundamental solution would involve solving attribution-based control (link below)
Surprising new results:
We finetuned GPT4o on a narrow task of writing insecure code without warning the user.
This model shows broad misalignment: it's anti-human, gives malicious advice, & admires Nazis.
This is *emergent misalignment* & we cannot fully explain it 🧵
@alexalbert__ Would like to hear more about Anthropic's preparation for global elections -- specifically what harms are anticipated and what evaluations are most relevant
As millions of people across the country take to the streets and raise their voices in response to the killing of George Floyd and the ongoing problem of unequal justice, I’ve heard many ask how we can sustain momentum to bring about real change.
FROM 3-4 PM THE #GivingBlueday TWEET WITH THE MOST RETWEETS WINS $1,000!!! Help MRun out and give this a retweet so we can pay for #NircaNats and keep the club affordable for everyone!!
Michigan team specialist David Sun shares his experience at SF Blockchain Week, from Dharma’s view on debt to Weyl’s fireside chat! https://t.co/CCqmEM1ED2
After all my hard work of how I wanted to start my clothing line, after all the crap people gave me, I am proud to announce my clothing line @sher__rag will be dropping September 15. Thank you everyone for supporting through this long journey.