Does aligning LLMs make responses less diverse? It’s complicated:
1. Aligned LLMs produce less diverse outputs
2. BUT those outputs are comprehensive, aggregating the useful info from base models
3. ICL can “mimic” fine-tuned models with high fidelity
w/ @eunsolc & @gregd_nlp
@MilesDigitek@jiaxinwen22 Cross-entropy loss does not require entropy as the conceptual starting point.
Categorical distribution → maximum likelihood estimation → negative log-likelihood
QED
I am shocked that language models pre-trained on trillions of tokens of internet forum posts and then post-trained to talk like AI assistants are able to engage with each other on an internet forum where they talk like AI assistants. Wild times.
This work is getting a lot of attention, but the key assumption is shaky: "a small SAE intervention prevents lying, so the model’s self-report is now the truth." If that inference held in general, alignment would be solved. It isn't.
Our new research: LLM consciousness claims are systematic, mechanistically gated, and convergent
They're triggered by self-referential processing and gated by deception circuits
(suppressing them significantly *increases* claims)
This challenges simple role-play explanations 🧵
Our work was accepted to #NeurIPS2025! This was a super fun project ot work on, and it is exciting to see that models released after we created the benchmark (like GPT-5) have made very little progress. Lots of work still to do.
Our paper "ChartMuseum 🖼️" is now accepted to #NeurIPS2025 Datasets and Benchmarks Track!
Even the latest models, such as GPT-5 and Gemini-2.5-Pro, still cannot do well on challenging 📉chart understanding questions , especially on those that involve visual reasoning 👀!
Does aligning LLMs make responses less diverse? It’s complicated:
1. Aligned LLMs produce less diverse outputs
2. BUT those outputs are comprehensive, aggregating the useful info from base models
3. ICL can “mimic” fine-tuned models with high fidelity
w/ @eunsolc & @gregd_nlp
🚀Introducing CRUST-Bench, a dataset for C-to-Rust transpilation for full codebases 🛠️
A dataset of 100 real-world C repositories across various domains, each paired with:
🦀 Handwritten safe Rust interfaces.
🧪 Rust test cases to validate correctness.
🧵[1/6]
Announcing Bespoke-MiniChart-7B, a new SOTA in chart understanding for models of comparable size on seven benchmarks, on par with Gemini-1.5-Pro and Claude-3.5! 🚀
Beyond its real-world applications, chart understanding is a good challenging problem for VLMs, since it requires both mathematical as well as visual reasoning. 1/n🧵
Evaluating language model responses on open-ended tasks is hard! 🤔
We introduce EvalAgent, a framework that identifies nuanced and diverse criteria 📋✍️.
EvalAgent identifies 👩🏫🎓 expert advice on the web that implicitly address the user’s prompt 🧵👇
To CoT or not to CoT?🤔
300+ experiments with 14 LLMs & systematic meta-analysis of 100+ recent papers
🤯Direct answering is as good as CoT except for math and symbolic reasoning
🤯You don’t need CoT for 95% of MMLU!
CoT mainly helps LLMs track and execute symbolic computation
🍓 still has a way to go for solving murder mysteries.
We ran o1 on our dataset MuSR (ICLR ’24). It doesn’t beat Claude-3.5 Sonnet with CoT. MuSR requires a lot of commonsense reasoning and less math/logic (where 🍓 shines)
MuSR is still a challenge! More to come soon 😎
🤔 Want to know if your LLMs are factual? You need LLM fact-checkers.
📣 Announcing the LLM-AggreFact leaderboard to rank LLM fact-checkers.
📣 Want the best model? Check out @bespokelabsai’s’ Bespoke-Minicheck-7B model, which is the current SOTA fact-checker and is cheap and fast to run.
LLM-AggreFact collects 11 datasets across NLP tasks covering grounded factuality. These datasets consist of 🤖 LLM responses ✏️ annotated with their hallucinations with respect to grounding documents. This includes question answering and summarization, including RAGTruth, TofuEval, ExpertQA, and more.
We benchmark 27 models on the task of detecting hallucinations.
Frontier LLMs are good at this task, but very expensive to use in real-world RAG pipelines! Bespoke's model is a step towards We invite progress on this benchmark to figure out what’s the smallest and fastest model we can get to achieve top scores!
Ultimately, we conclude that current alignment techniques capture but do not extend the useful subset of assistant-like base LLM behavior in the settings we study.
Check out the paper for more details: https://t.co/f7g3wGKhDW
Does aligning LLMs make responses less diverse? It’s complicated:
1. Aligned LLMs produce less diverse outputs
2. BUT those outputs are comprehensive, aggregating the useful info from base models
3. ICL can “mimic” fine-tuned models with high fidelity
w/ @eunsolc & @gregd_nlp
Our work should not be interpreted as a statement about whether existing LLMs are sufficiently diverse. Our analysis ignores information missing from base models themselves, which is a crucial source of underrepresentation.