Diffusion and flow matching-based robot planners are slow and generate noisy and jerky trajectories.
Delighted to share our ICRA 2026 paper, which leverages IMLE to improve planning frequency 19-fold from 4.3 Hz to 83 Hz and reduces jerk by 38% relative to flow matching.
Joint work w/ Grayson Lee, Minh Bui, Shuzi Zhou, Yankai Li and Mo Chen.
(1/7)
For more details, see:
- Project website: https://t.co/KPjQltVS13
- Paper: https://t.co/eQ0ykIqMSR
- Code: https://t.co/TZS21GAMNg
If you are at ICRA, come check out Poster 301 on Tuesday afternoon!
(7/7)
Diffusion and flow matching-based robot planners are slow and generate noisy and jerky trajectories.
Delighted to share our ICRA 2026 paper, which leverages IMLE to improve planning frequency 19-fold from 4.3 Hz to 83 Hz and reduces jerk by 38% relative to flow matching.
Joint work w/ Grayson Lee, Minh Bui, Shuzi Zhou, Yankai Li and Mo Chen.
(1/7)
I've fully covered the mathematical foundation of IceCache that was discussed in the paper, and parts that weren't detailed there.
IceCache is a novel approach to managing KV caches that uses Dynamic Continuous Indexing (DCI) to organize and retrieve tokens based on their semantic relationships more efficiently.
I walked through the complete sparse-retrieval theory step by step , every formula explained from first principles, every design choice motivated, every minute mathematical detail laid out. Implementation is in the next post .... check it out
https://t.co/pdzo5YX0Ka
Thank you for this wonderful paper, would love any feedback or guidance
@KL_Div@Mao_Yuzhen@q1tong
LLMs require more GPU memory as they generate longer responses. Can we make GPU memory constant without significantly sacrificing accuracy?
IceCache is a new method for managing KV caches that leverages Dynamic Continuous Indexing (DCI) to efficiently group and retrieve tokens by semantics.
Joint work w/ @Mao_Yuzhen, @q1tong and Martin Ester.
For details, check out the links below.
@mehranag@Moazeni_Alireza@yszhang170 Project website: https://t.co/RYts9mUH4P
Paper: https://t.co/pyM1Z59joq
Time and Location at ICLR 2026: https://t.co/GesN1nbRYh
Introducing WIMLE, a model-based RL method that substantially improves sample efficiency and asymptotic performance on hard tasks. Rather assuming a Gaussian world model, WIMLE trains a world model with IMLE.
Joint w/ @mehranag, @Moazeni_Alireza, @yszhang170.
See ๐ for links.
@MingyuanZhou Agreed - normalization as an idea is definitely not new (c.f. Sinkhorn iterations). I was merely pointing out that difference relative to GMMN could explain why GMMN didn't take off.
To be fair to the authors, I think the normalization of the kernel is key.
If the normalization weren't there, the kernel would not depend on other samples. In that case, the drift would be the same regardless of whether (1) all fake samples are far away from a real sample (which is common at the beginning of training), or (2) one fake sample is much closer to a real sample compared to other fake samples (which is common later on in training).
One would want the drift to be large in the former case and small in the latter case. But without normalization, there would be no way to make that happen.
The recent Drifting Models paper from Kaiming's group got very hyped over the past few days as a new generative modeling paradigm, but in fact, it can actually be seen as a scaled-up/generalized version of the good old GMMN from 2015 (and the authors themselves acknowledge this in the paper in Appendix C.2, noting that GMMN can be seen as Drifting Models for a particular choice of the kernel). Also, I am very skeptical about its scalability (for higher diversity / higher resolution datasets, larger models, and videos).
The way Drifting Models work is actually very simple:
- 1. Sample random noise z ~ N(0, I)
- 2. Feed it to the generator and get a fake sample x' = G(z)
- 3. For each fake sample x', compute its similarity (in the feature space of some encoder) to each of the real samples x_i from the current batch.
- 4. Push it closer toward these real samples using the similarities as weights (i.e. so that we push to the nearest ones the most).
- 5. To make sure that we don't have any sort of mode collapse, repel each fake sample from other fake samples via the same scheme.
- 6. Profit
Now, GMMN follows exactly the same scheme, with the only difference being that it uses a different (unnormalized) function in the "distance computation" and doesn't allow for cleanly plugging in normalization/scaling in the similarity scores or CFG.
Why didn't GMMN take off and why am I skeptical about Drifting Models? The issue is that it makes it much harder to compute any meaningful similarity when your dataset gets more diverse (happens when you switch to foundational T2I/T2V model training), or the batch size gets smaller (happens when your model size or training resolution increases), or your feature encoder produces less comparable representations (happens for videos or more diverse datasets). You can sure get informative similarities for 4096 batch size on the object-centric, limited diversity ImageNet with ResNet-50 feature encoder, but for smth like video generation, we train on hundreds of millions of videos or, at high resolutions + larger model sizes, with a batch size of 1 per GPU (not sure if will be fast to do inter-GPU distance computations).
From the theoretical perspective, even though the final objective and the practical training scheme are the same, the mathematical machinery to formulate the framework is very different and enables direct access to the drifting field (e.g., to easily enable CFG which the authors already did). But I guess what I like the most about this paper is that Kaiming's group is boldly pushing against the mainstream ideas of the community, and hopefully it will inspire others to also take a look at the fundamentals and stop cargo-culting diffusion models.
To clarify, my post was in response to @isskoro's post, which was on the relationship between drifting models and GMMN and why GMMN didn't take off. I agree with you that the idea of normalizing kernels is not new; I was merely pointing out why this difference could explain why GMMN didn't take off.
@jon_barron I think they use minibatches, so n is the batch size, but yes, the computational complexity is quadratic. It's indeed possible to use fast nearest neighbor approximations - in fact we did that in IMLE.