Ke Li 🍁

Verified account

@KL_Div

Assistant Professor of Computing Science @SFU. Ph.D. from @Berkeley_EECS and Bachelor's from @UofTCompSci. Formerly @GoogleAI and Member of @the_IAS.

Vancouver, Canada

Joined June 2019

418 Following

6.5K Followers

225 Posts

Pinned Tweet

2 days ago

Diffusion and flow matching-based robot planners are slow and generate noisy and jerky trajectories. Delighted to share our ICRA 2026 paper, which leverages IMLE to improve planning frequency 19-fold from 4.3 Hz to 83 Hz and reduces jerk by 38% relative to flow matching. Joint work w/ Grayson Lee, Minh Bui, Shuzi Zhou, Yankai Li and Mo Chen. (1/7)

14

787

89

482

93K

about 22 hours ago

@jon_barron @BenMildenhall @_pratul_ @martin_casado Congratulations, @_pratul_ and @BenMildenhall!

0

2

0

0

423

2 days ago

For more details, see: - Project website: https://t.co/KPjQltVS13 - Paper: https://t.co/eQ0ykIqMSR - Code: https://t.co/TZS21GAMNg If you are at ICRA, come check out Poster 301 on Tuesday afternoon! (7/7)

2

24

4

16

1K

2 days ago

Diffusion and flow matching-based robot planners are slow and generate noisy and jerky trajectories. Delighted to share our ICRA 2026 paper, which leverages IMLE to improve planning frequency 19-fold from 4.3 Hz to 83 Hz and reduces jerk by 38% relative to flow matching. Joint work w/ Grayson Lee, Minh Bui, Shuzi Zhou, Yankai Li and Mo Chen. (1/7)

14

787

89

482

93K

Who to follow

Assistant Professor at Generative Intelligence Lab @CMU_Robotics @CarnegieMellon. Understanding and creating pixels.

Verified account

xai cofounder. fighting lyme

Verified account

Co-founder of @Recursive_SI. ex-Meta FAIR Director. ex-Google. Reasoning, Optimization and Understanding LLM. Novelist in spare time. PhD in @CMU_Robotics.

2 days ago

We deploy our planner onboard a mobile robot and demonstrate real-time navigation in dynamic multi-agent environments. (6/7)

1

8

2

3

1K

about 1 month ago

A very nicely written blog post on IceCache! Highly recommended for anyone interested in how IceCache works under the hood.

about 1 month ago

I've fully covered the mathematical foundation of IceCache that was discussed in the paper, and parts that weren't detailed there. IceCache is a novel approach to managing KV caches that uses Dynamic Continuous Indexing (DCI) to organize and retrieve tokens based on their semantic relationships more efficiently. I walked through the complete sparse-retrieval theory step by step , every formula explained from first principles, every design choice motivated, every minute mathematical detail laid out. Implementation is in the next post .... check it out https://t.co/pdzo5YX0Ka Thank you for this wonderful paper, would love any feedback or guidance @KL_Div @Mao_Yuzhen @q1tong

0

7

2

7

4K

0

6

0

14

3K

about 1 month ago

@xiaolonw @sainingxie @Meta Congratulations!

0

0

0

0

32

about 1 month ago

@pulkitology Congratulations, Pulkit! Very impressive video!

1

2

0

0

271

about 1 month ago

@tatavishnurao Yes, it's here: https://t.co/b5e3vGuZLt

0

8

2

5

449

about 1 month ago

LLMs require more GPU memory as they generate longer responses. Can we make GPU memory constant without significantly sacrificing accuracy? IceCache is a new method for managing KV caches that leverages Dynamic Continuous Indexing (DCI) to efficiently group and retrieve tokens by semantics. Joint work w/ @Mao_Yuzhen, @q1tong and Martin Ester. For details, check out the links below.

5

214

15

217

21K

about 1 month ago

Project website: https://t.co/nXpL1erf7Z Paper: https://t.co/KtJUaIFD4c Time and Location at ICLR 2026: https://t.co/P67bXAMOH7

0

16

0

13

1K

about 1 month ago

@mehranag @Moazeni_Alireza @yszhang170 Project website: https://t.co/RYts9mUH4P Paper: https://t.co/pyM1Z59joq Time and Location at ICLR 2026: https://t.co/GesN1nbRYh

0

3

0

1

554

about 1 month ago

Introducing WIMLE, a model-based RL method that substantially improves sample efficiency and asymptotic performance on hard tasks. Rather assuming a Gaussian world model, WIMLE trains a world model with IMLE. Joint w/ @mehranag, @Moazeni_Alireza, @yszhang170. See 👇 for links.

KL_Div's tweet photo. Introducing WIMLE, a model-based RL method that substantially improves sample efficiency and asymptotic performance on hard tasks. Rather assuming a Gaussian world model, WIMLE trains a world model with IMLE.

Joint w/ @mehranag, @Moazeni_Alireza, @yszhang170.

See 👇 for links. https://t.co/K16hS8FU2m

1

29

7

23

4K

4 months ago

@MingyuanZhou Agreed - normalization as an idea is definitely not new (c.f. Sinkhorn iterations). I was merely pointing out that difference relative to GMMN could explain why GMMN didn't take off.

0

1

0

0

185

4 months ago

To be fair to the authors, I think the normalization of the kernel is key. If the normalization weren't there, the kernel would not depend on other samples. In that case, the drift would be the same regardless of whether (1) all fake samples are far away from a real sample (which is common at the beginning of training), or (2) one fake sample is much closer to a real sample compared to other fake samples (which is common later on in training). One would want the drift to be large in the former case and small in the latter case. But without normalization, there would be no way to make that happen.

Ivan Skorokhodov

4 months ago

The recent Drifting Models paper from Kaiming's group got very hyped over the past few days as a new generative modeling paradigm, but in fact, it can actually be seen as a scaled-up/generalized version of the good old GMMN from 2015 (and the authors themselves acknowledge this in the paper in Appendix C.2, noting that GMMN can be seen as Drifting Models for a particular choice of the kernel). Also, I am very skeptical about its scalability (for higher diversity / higher resolution datasets, larger models, and videos). The way Drifting Models work is actually very simple: - 1. Sample random noise z ~ N(0, I) - 2. Feed it to the generator and get a fake sample x' = G(z) - 3. For each fake sample x', compute its similarity (in the feature space of some encoder) to each of the real samples x_i from the current batch. - 4. Push it closer toward these real samples using the similarities as weights (i.e. so that we push to the nearest ones the most). - 5. To make sure that we don't have any sort of mode collapse, repel each fake sample from other fake samples via the same scheme. - 6. Profit Now, GMMN follows exactly the same scheme, with the only difference being that it uses a different (unnormalized) function in the "distance computation" and doesn't allow for cleanly plugging in normalization/scaling in the similarity scores or CFG. Why didn't GMMN take off and why am I skeptical about Drifting Models? The issue is that it makes it much harder to compute any meaningful similarity when your dataset gets more diverse (happens when you switch to foundational T2I/T2V model training), or the batch size gets smaller (happens when your model size or training resolution increases), or your feature encoder produces less comparable representations (happens for videos or more diverse datasets). You can sure get informative similarities for 4096 batch size on the object-centric, limited diversity ImageNet with ResNet-50 feature encoder, but for smth like video generation, we train on hundreds of millions of videos or, at high resolutions + larger model sizes, with a batch size of 1 per GPU (not sure if will be fast to do inter-GPU distance computations). From the theoretical perspective, even though the final objective and the practical training scheme are the same, the mathematical machinery to formulate the framework is very different and enables direct access to the drifting field (e.g., to easily enable CFG which the authors already did). But I guess what I like the most about this paper is that Kaiming's group is boldly pushing against the mainstream ideas of the community, and hopefully it will inspire others to also take a look at the fundamentals and stop cargo-culting diffusion models.

8

503

46

387

91K

6

98

4

85

26K

4 months ago

To clarify, my post was in response to @isskoro's post, which was on the relationship between drifting models and GMMN and why GMMN didn't take off. I agree with you that the idea of normalizing kernels is not new; I was merely pointing out why this difference could explain why GMMN didn't take off.

0

9

0

3

817

4 months ago

@jon_barron I think they use minibatches, so n is the batch size, but yes, the computational complexity is quadratic. It's indeed possible to use fast nearest neighbor approximations - in fact we did that in IMLE.

0

2

0

0

341

Last Seen Users on Sotwe

Trends for you

Most Popular Users