Michaël Defferrard

@m_deff

Scientist. ML and (computational) graphs at @Qualcomm AI Research. Previously @EPFL_en (PhD with @trekkinglemon), @BerkeleyLab.

🇨🇭 Switzerland

Joined May 2015

920 Following

1.5K Followers

1.5K Posts

m_deff retweeted

Reza Ebrahimi

@rzebrahimi

3 months ago

Transformers are data‑hungry in sequential tasks because they lack the right inductive bias. It’s well known that for many sequential problems (from adding numbers to step‑by‑step agentic execution and multi‑hop reasoning), transformers fail to generalize to longer sequences than they were trained on. “Train short, test long” often fails. The usual workaround is to "just train on whatever length you’ll need at test time". --------- 📉 But we show the consequence of this is data inefficiency: • Transformers can learn tasks for a single fixed sequence length fairly efficiently, but learning across multiple lengths requires much more data. • More importantly, transformers tend not to share mechanisms across tasks of different lengths; instead, they often learn isolated, length‑specific solutions. --------- 🧪 A simple way to test this: Consider modular addition (with and without CoT). Train a model to add 2, 3, …, L numbers at once and measure the data needed. Then train separate models for each length (2, 3, …, L) and sum their data requirements. 💡The intuition: If a model truly shares mechanisms across lengths, learning a distribution of lengths should require far fewer samples than learning each length separately. This comes from amortizing the learning cost: data for length n also helps the model learn length n+k. --------- 📊 Results: Sharing Factor κ = (sum of samples to learn each length separately) ÷ (samples to learn all lengths jointly) - κ > 1: mechanism sharing and amortized learning. - κ ≈ 1: learning length-specific solutions in isolation. - κ < 1: destructive interference; length-specific solutions compete for model capacity. Transformers showed low sharing factors, and even destructive interference with CoT. --------- ✨ Implications: This suggests that end-to-end learning in applied agentic settings, like robotics or GUI control, could be even more challenging. If data requirements grow unfavorably with sequence length, that might also help explain the persistent issues we see at large context lengths (e.g., context rot). Standard attention mechanism appears inefficient for step-by-step tasks, and we may ultimately be better off with recurrent agents.

rzebrahimi's tweet photo. Transformers are data‑hungry in sequential tasks because they lack the right inductive bias.

It’s well known that for many sequential problems (from adding numbers to step‑by‑step agentic execution and multi‑hop reasoning), transformers fail to generalize to longer sequences than they were trained on. “Train short, test long” often fails.

The usual workaround is to "just train on whatever length you’ll need at test time".

---------
📉 But we show the consequence of this is data inefficiency:

• Transformers can learn tasks for a single fixed sequence length fairly efficiently, but learning across multiple lengths requires much more data.

• More importantly, transformers tend not to share mechanisms across tasks of different lengths; instead, they often learn isolated, length‑specific solutions.

---------
🧪 A simple way to test this:
Consider modular addition (with and without CoT). Train a model to add 2, 3, …, L numbers at once and measure the data needed. Then train separate models for each length (2, 3, …, L) and sum their data requirements.

💡The intuition:
If a model truly shares mechanisms across lengths, learning a distribution of lengths should require far fewer samples than learning each length separately.

This comes from amortizing the learning cost: data for length n also helps the model learn length n+k.

---------
📊 Results:

Sharing Factor κ = (sum of samples to learn each length separately) ÷ (samples to learn all lengths jointly)

- κ > 1: mechanism sharing and amortized learning.
- κ ≈ 1: learning length-specific solutions in isolation.
- κ < 1: destructive interference; length-specific solutions compete for model capacity.

Transformers showed low sharing factors, and even destructive interference with CoT.

---------
✨ Implications:
This suggests that end-to-end learning in applied agentic settings, like robotics or GUI control, could be even more challenging.

If data requirements grow unfavorably with sequence length, that might also help explain the persistent issues we see at large context lengths (e.g., context rot).

Standard attention mechanism appears inefficient for step-by-step tasks, and we may ultimately be better off with recurrent agents.

512

m_deff retweeted

Grigory Sapunov

@che_shr_cat

3 months ago

1/ We know Transformers fail at length extrapolation. But new research shows a deeper flaw: they fail at IN-DISTRIBUTION state tracking. They don't learn algorithmic rules, they just memorize isolated circuits per length. 🧵

che_shr_cat's tweet photo. 1/
We know Transformers fail at length extrapolation. But new research shows a deeper flaw: they fail at IN-DISTRIBUTION state tracking. They don't learn algorithmic rules, they just memorize isolated circuits per length. 🧵 https://t.co/CGVFMrqkPU

381

384

36K

m_deff retweeted

Shubhendu Trivedi @_onionesque

9 months ago

Looking at the thread. The common frame to look at the more general phenomenon involves an eigenproblem of the form Oƒ = λƒ, where the operator O encodes either: a symmetry (translations, rotations, general group transformations), or a a statistic (e.g. covariance, correlation),

175

170

34K

Michaël Defferrard

@m_deff

Who to follow

Last Seen Users on Sotwe

Trends for you

Most Popular Users