Max Zuo @max_zuo - Twitter Profile

Pinned Tweet

almost 2 years ago

Ever wonder if LLMs use tools🛠️ the way we ask them? We explore LLMs using classical planners: are they writing *correct* PDDL (planning) problems? Say hi👋 to Planetarium🪐, a benchmark of 132k natural language & PDDL problems. 📜 Preprint: https://t.co/kXItV6j6Dg 🧵1/n

max_zuo's tweet photo. Ever wonder if LLMs use tools🛠️ the way we ask them?

We explore LLMs using classical planners: are they writing *correct* PDDL (planning) problems?

Say hi👋 to Planetarium🪐, a benchmark of 132k natural language & PDDL problems.

📜 Preprint: https://t.co/kXItV6j6Dg
🧵1/n https://t.co/BpANCiBULz

9

192

40

191

189K

Max Zuo @max_zuo

6 months ago

LLMs don’t learn or think like us. So why do we still evaluate their generalization in the same way? A better look at generalization in LLMs:

Yeganeh Kordi @yeganekordi

6 months ago

How well do language models generalize to problems that are harder, or even easier, than the ones they’ve trained on? We show that LLMs don’t generalize across difficulty levels quite as much as you might think. 🧵

yeganekordi's tweet photo. How well do language models generalize to problems that are harder, or even easier, than the ones they’ve trained on?

We show that LLMs don’t generalize across difficulty levels quite as much as you might think. 🧵 https://t.co/oUjC6WMnpm

1

30

8

15

3K

0

3

1

0

300

max_zuo retweeted

Brown CS @BrownCSDept

about 1 year ago

We're happy to announce that effective as of July 1, 2025, faculty members @stevebach and @drsrinathsridha have received named chairs. Steve is now the Eliot Horowitz Assistant Professor in CS and Srinath is the John E. Savage Assistant Professor in CS: https://t.co/HF8UPA3VYf

BrownCSDept's tweet photo. We're happy to announce that effective as of July 1, 2025, faculty members @stevebach and @drsrinathsridha have received named chairs. Steve is now the Eliot Horowitz Assistant Professor in CS and Srinath is the John E. Savage Assistant Professor in CS: https://t.co/HF8UPA3VYf https://t.co/oK8MYv7Drg

0

79

5

7

5K

max_zuo retweeted

Michael Lepori @Michael_Lepori

about 1 year ago

I will be at #ICLR2025 in a few days to present this work with @surajk610! Feel free to DM me if you want to chat about mechinterp, cognitive science, or anything else!

1

44

2

14

3K

max_zuo retweeted

Apoorv Khandelwal @apoorvkh

over 1 year ago

I started a blog! First post is everything I know about setting up (fast, reproducible, error-proof) Python project environments using the latest tools. These methods have saved me a lot of grief. Also a short guide to CUDA in appendix :) https://t.co/AIiYgyZB7C

6

73

8

53

5K

max_zuo retweeted

Jack Merullo @jack_merullo_

over 1 year ago

If we guide the activation in the ‘right’ part of the subspace, we can improve performance pretty dramatically, although we don’t completely fix the problem.

jack_merullo_'s tweet photo. If we guide the activation in the ‘right’ part of the subspace, we can improve performance pretty dramatically, although we don’t completely fix the problem. https://t.co/abhKeNhuPS

1

4

3

1

433

max_zuo retweeted

Ruochen Zhang

@ruochenz_

over 1 year ago

🤔How do multilingual LLMs encode structural similarities across languages? 🌟We find that LLMs use identical circuits when languages share the same morphosyntactic processes. However, they involve specialized components to handle tasks if contain specific linguistic features⤵️

ruochenz_'s tweet photo. 🤔How do multilingual LLMs encode structural similarities across languages?
🌟We find that LLMs use identical circuits when languages share the same morphosyntactic processes. However, they involve specialized components to handle tasks if contain specific linguistic features⤵️ https://t.co/jlOfWICnV9

2

156

36

91

32K

Max Zuo @max_zuo

over 1 year ago

@paul_cal @FPiedrahitaV @jacobli99 @mlittmancs @stevebach You can find the domain files in our repo. To give you an idea of the mistake they’re making: in the whole file they generate, they add ~6 extra characters that shouldn’t be there which break equivalence.

0

1

0

219

Max Zuo @max_zuo

over 1 year ago

Is o1 really better at reasoning? Or is it just reinforcing what it already knows? We put o1-preview to the test on (some of) our planning problem generation dataset, planetarium🪐. Here’s what we found: 💻repo: https://t.co/eBnNyAyRYM 🧵Thread 👇

max_zuo's tweet photo. Is o1 really better at reasoning? Or is it just reinforcing what it already knows?

We put o1-preview to the test on (some of) our planning problem generation dataset, planetarium🪐. Here’s what we found:

💻repo: https://t.co/eBnNyAyRYM
🧵Thread 👇 https://t.co/fkddlMEF90

14

382

52

266

71K

Max Zuo @max_zuo

over 1 year ago

@paul_cal @FPiedrahitaV @jacobli99 @mlittmancs @stevebach It actually happens for blocksworld too, but blocksworld is more common and appears with typing less frequently in PDDL examples, in my experience.

1

2

0

266

Max Zuo @max_zuo

over 1 year ago

@paul_cal @FPiedrahitaV @jacobli99 @mlittmancs @stevebach It’s not a quirky environment at all. I’m actually simplifying the domain for them. But many PDDL environments have typing, this one didn’t. It just keeps assuming typing is a thing. o1 and 4o are ignoring their context and solving the problem not with reasoning but memory

1

3

0

256

Max Zuo @max_zuo

over 1 year ago

@paul_cal @FPiedrahitaV @jacobli99 @mlittmancs @stevebach If anything I think it shows we still have space for making benchmarks that can explicitly differentiate reasoning from just having better/more pretraining.

1

7

0

282

Max Zuo @max_zuo

over 1 year ago

@paul_cal @FPiedrahitaV @jacobli99 @mlittmancs @stevebach They’re definitely big percentage changes. But my takeaway here was that o1 ‘s behaviors & types of mistakes match 4o, in ways I wouldn’t expect something trained on reasoning would exhibit.

1

16

0

2

1K

Max Zuo @max_zuo

over 1 year ago

@AlexandreTL2 @FPiedrahitaV @jacobli99 @mlittmancs @stevebach Take a look at the second half of the thread. 4o and o1 technically get a lot more of what we would say the harder “logic” down correctly, but almost all their mistakes are due to ignoring the domain. o1 just ignores the domain less

0

8

0

2

1K

Max Zuo @max_zuo

over 1 year ago

o1 exemplifies more than ever a need to evaluate emergent reasoning capabilities separately from pretraining regurgitation, which is why we’re continuing to improve Planetarium🪐! 💻 Keep up to date through our repo: https://t.co/eBnNyAyRYM 🧵5/5

2

52

2

20

4K

Max Zuo @max_zuo

over 1 year ago

These characteristics (ignoring domain, better performance on common domains) seem more likely to be the fruit of an incremental improvement to training than a jump in reasoning, at least on our problems. 🧵4/5

1

43

1

4

4K

Max Zuo @max_zuo

almost 2 years ago

@chris_j_paxton @lhmccabe I think this view also helps explain why certain tasks seem impossible to learn through ICL (negated sentiment analysis for example)

0

3

0

172

max_zuo retweeted

Yann LeCun

@ylecun

almost 2 years ago

Not only can't LLMs plan, they can't even generate specifications of a problem (in PDDL) that a standard planner could solve.

55

662

101

368

182K

Max Zuo

@max_zuo

Last Seen Users on Sotwe

Trends for you

Most Popular Users