Ever wonder if LLMs use tools🛠️ the way we ask them?
We explore LLMs using classical planners: are they writing *correct* PDDL (planning) problems?
Say hi👋 to Planetarium🪐, a benchmark of 132k natural language & PDDL problems.
📜 Preprint: https://t.co/kXItV6j6Dg
🧵1/n
How well do language models generalize to problems that are harder, or even easier, than the ones they’ve trained on?
We show that LLMs don’t generalize across difficulty levels quite as much as you might think. 🧵
We're happy to announce that effective as of July 1, 2025, faculty members @stevebach and @drsrinathsridha have received named chairs. Steve is now the Eliot Horowitz Assistant Professor in CS and Srinath is the John E. Savage Assistant Professor in CS: https://t.co/HF8UPA3VYf
I will be at #ICLR2025 in a few days to present this work with @surajk610! Feel free to DM me if you want to chat about mechinterp, cognitive science, or anything else!
I started a blog! First post is everything I know about setting up (fast, reproducible, error-proof) Python project environments using the latest tools. These methods have saved me a lot of grief. Also a short guide to CUDA in appendix :)
https://t.co/AIiYgyZB7C
If we guide the activation in the ‘right’ part of the subspace, we can improve performance pretty dramatically, although we don’t completely fix the problem.
🤔How do multilingual LLMs encode structural similarities across languages?
🌟We find that LLMs use identical circuits when languages share the same morphosyntactic processes. However, they involve specialized components to handle tasks if contain specific linguistic features⤵️
@paul_cal@FPiedrahitaV@jacobli99@mlittmancs@stevebach You can find the domain files in our repo.
To give you an idea of the mistake they’re making: in the whole file they generate, they add ~6 extra characters that shouldn’t be there which break equivalence.
Is o1 really better at reasoning? Or is it just reinforcing what it already knows?
We put o1-preview to the test on (some of) our planning problem generation dataset, planetarium🪐. Here’s what we found:
💻repo: https://t.co/eBnNyAyRYM
🧵Thread 👇
@paul_cal@FPiedrahitaV@jacobli99@mlittmancs@stevebach It’s not a quirky environment at all. I’m actually simplifying the domain for them. But many PDDL environments have typing, this one didn’t. It just keeps assuming typing is a thing.
o1 and 4o are ignoring their context and solving the problem not with reasoning but memory
@paul_cal@FPiedrahitaV@jacobli99@mlittmancs@stevebach They’re definitely big percentage changes. But my takeaway here was that o1 ‘s behaviors & types of mistakes match 4o, in ways I wouldn’t expect something trained on reasoning would exhibit.
@AlexandreTL2@FPiedrahitaV@jacobli99@mlittmancs@stevebach Take a look at the second half of the thread. 4o and o1 technically get a lot more of what we would say the harder “logic” down correctly, but almost all their mistakes are due to ignoring the domain. o1 just ignores the domain less
o1 exemplifies more than ever a need to evaluate emergent reasoning capabilities separately from pretraining regurgitation, which is why we’re continuing to improve Planetarium🪐!
💻 Keep up to date through our repo: https://t.co/eBnNyAyRYM
🧵5/5
These characteristics (ignoring domain, better performance on common domains) seem more likely to be the fruit of an incremental improvement to training than a jump in reasoning, at least on our problems.
🧵4/5
@chris_j_paxton @lhmccabe I think this view also helps explain why certain tasks seem impossible to learn through ICL (negated sentiment analysis for example)