Anand Gopalakrishnan @agopal42 - Twitter Profile

Pinned Tweet

5 months ago

Our new paper shows that RoPE—the positional encoding used in most modern LLMs like Qwen, Gemma, DeepSeek—has a fundamental flaw: it entangles "what" (content) and "where" (position) information. Our fix (PoPE) is simple but powerful. Paper: https://t.co/XlltfcSwHQ

30

1K

177

1K

156K

Anand Gopalakrishnan @agopal42

1 day ago

@liu_yuejiang @NUSComputing Congrats! 🎉

1

0

145

Anand Gopalakrishnan @agopal42

6 days ago

@xuanalogue Yeah, I agree. I guess we'll all be have to really internalise the idea of doing something for its own sake, which is a good outcome.

0

1

0

27

Anand Gopalakrishnan @agopal42

6 days ago

@xuanalogue So the only way for society to deal with the effects of large scale automation in the long run is to strive to reach enlightenment. So win-win overall? :D

1

0

58

Who to follow

Sjoerd van Steenkiste

@vansteenkiste_s

Research Scientist @GoogleDeepMind. Agents / World models / Gemini.

Louis Kirsch

@LouisKirschAI

Driving the automation of AI Research. Co-Founder @inherent_labs. Ex @GoogleDeepMind. PhD @SchmidhuberAI. @UCL, @HPI_DE alumnus.

Emma Brunskill

@EmmaBrunskill

Associate professor, Computer Science. Stanford. Stanford's Human Centered AI (HAI) Institute. Opinions expressed are my own.

Anand Gopalakrishnan @agopal42

6 days ago

@LouisKirschAI @SchmidhuberAI @inherent_labs Congratulations!! 🎉

0

161

agopal42 retweeted

Luca Ambrogioni

@LucaAmb

23 days ago

1/?) As promised to Sander Dieleman (@sedielem), we’re finally excited to share: Towards Closing the Autoregressive Gap in Language Modeling via Entropy-Gated Continuous Bitstream Diffusion We show that continuous diffusion can achieve very strong language modeling performance when operating directly on bitstreams, outperforming masked and uniform diffusion baselines, and essentially matching autoregressive models under our evaluation settings.

LucaAmb's tweet photo. 1/?) As promised to Sander Dieleman (@sedielem), we’re finally excited to share:

Towards Closing the Autoregressive Gap in Language Modeling via Entropy-Gated Continuous Bitstream Diffusion

We show that continuous diffusion can achieve very strong language modeling performance when operating directly on bitstreams, outperforming masked and uniform diffusion baselines, and essentially matching autoregressive models under our evaluation settings.

6

225

37

161

27K

agopal42 retweeted

Michael C. Mozer @mc_mozer

about 1 month ago

[1/5] Intelligent behavior relies on maintaining an evolving, dynamic model of the environment. But even frontier models sometimes fail at tracking state in dialogs, e.g., actual transcript from 4/20/2026:

mc_mozer's tweet photo. [1/5] Intelligent behavior relies on maintaining an evolving, dynamic model of the environment. But even frontier models sometimes fail at tracking state in dialogs, e.g., actual transcript from 4/20/2026: https://t.co/84McZVFqBl

3

53

12

24

9K

Anand Gopalakrishnan @agopal42

3 months ago

@mli0603 Nice blogpost and explanations! You might find our recent work helpful in understanding another failure mode of RoPE -- entanglement of 'what' and 'where'. https://t.co/STVBkKY58J

Anand Gopalakrishnan @agopal42

5 months ago

Our new paper shows that RoPE—the positional encoding used in most modern LLMs like Qwen, Gemma, DeepSeek—has a fundamental flaw: it entangles "what" (content) and "where" (position) information. Our fix (PoPE) is simple but powerful. Paper: https://t.co/XlltfcSwHQ

30

1K

177

1K

156K

0

6

0

5

2K

agopal42 retweeted

Kazuki Irie @kzkirie

3 months ago

Back in 2019, I reduced transformer LM KV-cache size by: (1) setting K=V (storing only K), (2) deeper FF blocks & fewer self-attn layers overall. Published at ICASSP 2020. To my knowledge, the first publication on KV-cache reduction--lmk if you know anything older!

kzkirie's tweet photo. Back in 2019, I reduced transformer LM KV-cache size by: (1) setting K=V (storing only K), (2) deeper FF blocks & fewer self-attn layers overall.

Published at ICASSP 2020. To my knowledge, the first publication on KV-cache reduction--lmk if you know anything older! https://t.co/CtoSTt1FHK

8

86

9

42

5K

agopal42 retweeted

Jonas

@LoosJonas

4 months ago

Can we replace RoPE with PoPE (Polar Coordinate Positional Embeddings) in pretrained language models? Turns out we can! Using small pythia models, after a small recalibration (~2% of pretrain), we get significantly better length generalization. 1/3 https://t.co/Jf397xTsU9

LoosJonas's tweet photo. Can we replace RoPE with PoPE (Polar Coordinate Positional Embeddings) in pretrained language models?

Turns out we can! Using small pythia models, after a small recalibration (~2% of pretrain), we get significantly better length generalization.

1/3

https://t.co/Jf397xTsU9 https://t.co/6MgSxQIXRr

1

14

4

2

1K

agopal42 retweeted

Hansen Lillemark @hansenlillemark

5 months ago

State of the art World Models still lack a unified world memory for representing and predicting dynamics out of their field of view. Why is that, and how can we fix it? Introducing Flow Equivariant World Models: models with memory capable of predicting out of view dynamics!🧵⬇️

17

773

104

532

114K

agopal42 retweeted

Andy Keller @t_andy_keller

5 months ago

When you're crossing the street and turn your head, you typically remember whether or not a car is coming from the other direction - so why can't today's world models? Introducing Flow Equivariant World Models https://t.co/xUmWLjj0cW Led by @hansenlillemark & @huskydogewoof🧵👇

1

50

10

7

4K

Anand Gopalakrishnan @agopal42

5 months ago

@sasuke___420 @jm_alexia No we don't. By partial RoPE you mean applying rotations on a subset of all channels/features? Since that's an orthogonal design choice and can be done on both RoPE and PoPE we decided to compare the simplest versions.

0

1

0

106

agopal42 retweeted

Kazuki Irie @kzkirie

6 months ago

Humans can't write programs that classify cats vs dogs. Deep learning lets GD write that program. Continual learning is the same: it's too hard for us to design good CL algorithms. Let GD write that algorithm too. That’s the idea of metalearning CL algos: https://t.co/FjmEgwDlaO

0

13

2

9

860

Anand Gopalakrishnan @agopal42

5 months ago

@francoisfleuret To _err is human, to _errr is undefined

0

3

0

196

Anand Gopalakrishnan @agopal42

5 months ago

@deaton_jon Yes that's true, but the eqns (7-10) were presented in a feature/channel-wise manner. So we wrote an extra multiplication (per channel). Thanks for your response!

0

1

0

216

Anand Gopalakrishnan @agopal42

5 months ago

Our new paper shows that RoPE—the positional encoding used in most modern LLMs like Qwen, Gemma, DeepSeek—has a fundamental flaw: it entangles "what" (content) and "where" (position) information. Our fix (PoPE) is simple but powerful. Paper: https://t.co/XlltfcSwHQ

30

1K

177

1K

156K

Anand Gopalakrishnan @agopal42

5 months ago

@eshear By log-polar do you mean log spaced frequencies (thetas) used in RoPE? Or something else?

0

1

0

59

Anand Gopalakrishnan @agopal42

5 months ago

@imayank42 Plan to release the repo soon-ish. Thanks for your interest!

1

5

0

718

Anand Gopalakrishnan @agopal42

5 months ago

Posted this a day early and the pun practically writes itself. Noooooo!

Anand Gopalakrishnan @agopal42

5 months ago

Our new paper shows that RoPE—the positional encoding used in most modern LLMs like Qwen, Gemma, DeepSeek—has a fundamental flaw: it entangles "what" (content) and "where" (position) information. Our fix (PoPE) is simple but powerful. Paper: https://t.co/XlltfcSwHQ

30

1K

177

1K

156K

1

11

2

0

2K

Anand Gopalakrishnan @agopal42

5 months ago

11/ Key takeaway: The what-where entanglement in RoPE hurts sequence modelling performance and length generalization. PoPE's disentanglement provides a powerful inductive bias that solves these issues. Huge thanks to my co-authors @robert_csordas , @SchmidhuberAI, @mc_mozer !

1

28

0

1

3K

Anand Gopalakrishnan

@agopal42

Who to follow

Last Seen Users on Sotwe

Trends for you

Most Popular Users