mostly awks, sometimes funny, advocate of animal rights, loves long walks. I steal birb photos from the internet and post them here cause it makes me happy
@ICICILombard hey, is it your policy to reject health insurance to young people with medical history ? I’m ready pay extra premium, and wait longer for lock period. Mine has been rejected, with no reason being given. Im 29 yo, with no chronic illness, and yet my claim is rejected
For women / under-represented minorities applying to AI PhD programs this year: if you’d like feedback on your application, feel free to consider me a resource, particularly if you’re a non-traditional applicant. DMs open!
(Paying it forward for the @WiMLworkshop community 💛)
One of the most important papers in AI: a tiny brain-inspired 27M param model trained on 1000 samples outperforms o3-mini-high on reasoning tasks!
Still can't believe this tiny lab of Tsinghua grads gets 40% on ARC-AGI, solves hard sudoku and mazes.
We're still so early.
Attention is all you need - but how does it work? In our new paper, we take a big step towards understanding it. We developed a way to integrate attention into our previous circuit-tracing framework (attribution graphs), and it's already turning up fascinating stuff! 🧵
Did you remember all the things to do before sharing a figure?
Remove "_" from the labels
save as pdf
...
A lot of deeper tips to writing are in the shared document, but somehow this practical list was always the most popular.
Good luck in #neurips
Big news: we've figured out how to make a *universal* reward function that lets you apply RL to any agent with:
- no labeled data
- no hand-crafted reward functions
- no human feedback!
A 🧵 on RULER
There's been a hole at the heart of #LLM evals, and we can now fix it.
📜New paper: Answer Matching Outperforms Multiple Choice for Language Model Evaluations.
❗️We found MCQs can be solved without even knowing the question. Looking at just the choices helps guess the answer and get high accuracies. This affects popular benchmarks like MMLU-Pro, SuperGPQA etc. and even "multimodal" benchmarks like MMMU-Pro, which can be solved without even looking at the image ⁉️.
Such choice-only shortcuts are hard to fix. We find prior attempts at fixing them-- GoldenSwag (for HellaSwag) and TruthfulQA v2 ended up worsening the problem. MCQs are inherently a discriminative task, only requiring picking the correct choice among a few given options. Instead we should evaluate language models for the generative capabilities they are used for. We show discrimination is easier than even verification, let alone generation.
🤔 But how do we grade generative responses outside "verifiable domains" like code and math? So many paraphrases are valid answers... We show a scalable alternative--Answer Matching--works surprisingly well. Its simple--get generative responses to existing benchmark questions that are specific enough to have a semantically unique answer without showing choices. Then, use an LM to match the response against the ground-truth answer.
👨🔬We conduct a meta-evaluation by comparing to ground-truth verification on MATH, and human grading on MMLU-Pro and GPQA-Diamond questions. Answer Matching outcomes give near-perfect alignment, with even small (recent) models like Qwen3-4B. In contrast, LLM-as-a-judge, even with frontier reasoning models like o4-mini, fares much worse. This is because without the reference-answer, the model is tasked with verification, which is harder than what answer matching requires--paraphrase detection--a skill modern language models have aced💡
Lets shift the benchmarking ecosystem from MCQs to Answer Matching. Impacts:
Leaderboards: We show model rankings can change and accuracies go down making benchmarks seem less saturated.
Benchmark Creation: Instead of creating harder MCQs, we should focus our efforts on creating questions with for answer matching, much like SimpleQA, GAIA etc.
🤑 Cost: Finally, to our great surprise, answer matching evals are cheaper to run than MCQs!
See our paper for more, its packed with insights. 🧵 has paper and more result figures.
I was just thinking about FAIR’s research papers the other day—they really are all that thorough…..two papers that come to mind-COCONUT paper and an old paper LRP for transformers implementation pape
Facebook AI Research (FAIR) is a small, prestigious lab in Meta. We don't train large models like GenAI or MSL, so it's natural that we have limited GPUs. GenAI or MSL's success or failure, past or future, doesn't reflect the work of FAIR. It is important to make this distinction
🚨 New paper alert!
Linear representation hypothesis (LRH) argues concepts are encoded as **sparse sum of orthogonal directions**, motivating interpretability tools like SAEs. But what if some concepts don’t fit that mold? Would SAEs capture them? 🤔
1/11
After more than half a year of work, it's finally done! In my new paper I demonstrate a new technique for mesoscopic understanding of language model behavior over time. We show that LM hidden states can be approximated by the same mathematics as govern the statistical properties of microscopic particles. And, more importantly, that this approximation is sufficient to very cheaply predict LLM misalignment and failure modes before they occur during inference.
Check it out below!
🧵 4/8 Result #1: Three distinct performance regimes 📈📉
Comparing thinking vs non-thinking models under the same inference token compute revealed:
🟡 LOW complexity: Standard LLMs actually outperform reasoning models (and are more efficient!)
🔵 MEDIUM complexity: Reasoning models gain advantage.
🔴 HIGH complexity: Both models completely collapse to 0% accuracy.
@bookingcom They are not helpful, I will be booking an e-dakhil form tomorrow. Is there any support you can offer-the experience has been really appalling and the manager is refusing any compensation and not even solving the problem?
@bookingcom i booked a hotel through your portal in Darjeeling-called Orsino Spa resort. There was no hot water for 2 days and the manager provided us with no support in this regard. We have more detailed complaints re the hotel, how can we solve this?
“@pinetreeresorts Horrible experience:
https://t.co/uruIZwM73D hot water for 2 days. Manager’s solution? Buckets of water—1 hot, 2 cold. Is this how you expect guests to maintain basic hygiene?
2.WiFi is terrible and practically unusable.