The takeaway: You can "Train It and Forget It."
The privacy & simplicity benefits of dropping the BPE merge list at inference may outweigh the minimal performance trade-offs , enabling more secure tokenization for deployed LLMs.
Joint work with @kartik_goyal_ (4/4)
BPE merge lists in LLMs are a privacy risk. What if we just ignored them at inference?
Our paper shows you can ditch the merge list without retraining. Merge-list-free tokenization has minimal impact on performance & can even improve it on some tasks.
Paper: https://t.co/bQ8fn860H3
👇 (1/4)
The results? Deliberately corrupting the merge list tanks performance.
But our compression-based methods are robust, even *outperforming* the standard tokenizer on QA (MMLU/ARC) & open-ended generation. We saw only modest drops in machine translation. (3/4)
LLMs don’t take tests like students.
So why evaluate them like students?
Our method decouples reasoning from answer selection.
It’s automatic, scalable, and works with existing QA benchmarks.
📄 https://t.co/3dNk9foHL4
w/ Ryan Yan and @kartik_goyal_
Why do we evaluate LLMs using multiple-choice QA...
...when in practice, we ask them to generate open-ended answers?
Standard evaluation rewards models for choosing the right letter — not for reasoning their way to an answer.
A better alternative: Cascaded Information Disclosure
We tried using another LLM to “judge” the model’s reasoning.
Turns out it’s unreliable — even when we feed it perfect explanations (!)
But when we match explanations to answers, accuracy shoots up (>99%).
No hallucinated grading.
What did I tell you a few days ago? 2024 is the year of robotics. Mobile-ALOHA is an open-source robot hardware that can do dexterous, bimanual tasks like cooking a meal (with human teleoperation). Very soon, hardware will no longer bottleneck us on the quest for human-level, generally capable robots. The brain will be.
This work is done by 3 researchers with academic budget. What an incredible job! Stanford rocks! Congrats to @zipengfu@tonyzzhao@chelseabfinn
Academia is no longer the place for the biggest frontier LLMs, simply because of resource constraints. But robotics levels the playing field a bit between academia and industry, at least in the near term. More affordable hardware is the inevitable trend. Advice for aspiring PhD students: embrace robotics - less crowded, more impactful.
Website: https://t.co/gFcgiuTxrg
Hardware assembly tutorial (oh yes we need more of these!): https://t.co/FK5Twrgniz
Codebase: https://t.co/8FsfEXGfVg
This is an interesting paper that learns a process reward model without human annotations.
The idea is to evaluate the accuracy of full reasoning traces generated from a given partial reasoning step.
Nice to see Llemma-34B getting 47.3% on MATH!
https://t.co/FtApBbuGha
Intuitively, superhuman AI systems should "know" if they're acting safely.
But can we "summon" such concepts from strong models with only weak supervision?
Incredibly excited to finally share what we've been working on: weak-to-strong generalization. 1/
https://t.co/FiFGhrqqE0
The first thing you need to build a high quality mathematics model is high quality mathematics data. Don't worry, we got your back!
Hear the oral at the Math-AI Workshop!
https://t.co/Zh8QAARF67