I'm excited and grateful to share our ICML spotlight paper "Accelerating Q-learning through Efficient Value-Sharing across Actions". It introduces the "mean-expansion layer", a very simple parameter-free layer that accelerates action-value learning and improves performance! (1/6)
@_Suresh2 No, actually. There is an implicit assumption in this method that we are applying this to discrete-action Q-networks, where the output is a vector of action-values. One forward pass already outputs a value for every action, and we just add a layer after that.
I'm excited and grateful to share our ICML spotlight paper "Accelerating Q-learning through Efficient Value-Sharing across Actions". It introduces the "mean-expansion layer", a very simple parameter-free layer that accelerates action-value learning and improves performance! (1/6)
@mic_nau Thanks for your interest! Unfortunately, at the moment, no. @Sisyfuzz and I are working on generalizing this to continuous actions right now. I’ll be sure to keep you posted on that.
(6/6)
Details
- Paper: https://t.co/Gf8dA6nFnP (Appendix D if you just want the PyTorch code)
- Talk 🍿: https://t.co/KdYVJt30Gu
- ICML Poster at Wed, Jul 8, 2026 • 2:30 PM – 4:15 PM in HALL A #404
(5/6) Big thanks to my collaborators, Brett Daley, @white_martha , and @MarlosCMachado.
I'm also grateful that our paper won the Best paper runner-up at the Adaptive Learning Agents workshop at AAMAS.
Try out the mean-expansion layer and let us know how it goes!
We propose a simple parameter-free layer, called the mean-expansion (ME) layer, that can be applied to the end of a Q-network to accelerate action-value learning in deep RL. The ME layer improves DQN and IQN's performance across 57 games without changing the algorithms or hypers!
Friends at #AAMAS, tomorrow at 11 AM at the ALA workshop, I will be giving an early presentation of our #ICML2026 Spotlight, "Accelerating Q-learning through Efficient Value-sharing across Actions". Done in collaboration with Brett Daley, @white_martha, and @MarlosCMachado.
Our paper, “The Cell Must Go On! AgarCL as an Evaluation Platform for Continual RL“, has been accepted at @RL_Conference'26!
This work was led by @Mohamed15069 as a master's student at @UAlberta@AmiiThinks.
Preprint: https://t.co/e6u3SNi7iY
Blog post: https://t.co/yN3eFiwULw
A couple of months ago, we released a preprint of one of my favourite papers I’ve ever written. It lies at the intersection of representation learning and neuroscience. I have now written a blog post about it.
Preprint: https://t.co/vtDeBzvjsq
Blog post: https://t.co/d5rPZHoGaC
A key claim here https://t.co/2WDyQ5PIQc is that next token prediction has no inherent preference for a heliocentric Copernicus theory https://t.co/WS6cIEfUy1 over a geocentric Ptolemy https://t.co/XJHIpm0Pdj theory of observations.
Predicting the next latent fixes that.
Here is my response about RL comment after watching the podcast of @dwarkesh_sp interviewing @karpathy and his comment on RL: https://t.co/OA0nwALSCw
RL is a powerful idea, but let's be thoughtful about when we actually need it.
RL has an incredible 120+ year journey in animal learning—starting with Edward Thorndike's puzzle box experiments in 1898, through decades of psychology research on trial-and-error learning, culminating in 1997 discovery by Schultz, Dayan, and Montague that dopamine neurons literally implement temporal difference learning. That's real RL: agent -environments experiential interaction, and actual learning through experience.
But what we're calling "RL" in LLMs today? It's mostly optimization with reward-weighted gradients on offline data. There's no agent exploring an environment, no real interaction loop. It's just: data in, neural net processes it, and objective function for the output is weighted by reward signals during final RL training stage after intermediate training steps.
Here's the thing—RL fundamentally addresses problems we'd solve with dynamic programming, except the state space is astronomically large ( easily more states than atoms in the universe--a 19x19 Computer Go roughly has 10^170 states!) and we don't have a model of the world. So we use RL as an approximation learning technique. It's essential for these kinds of problems, but not every problem is like this.
When you have interactive, experiential tasks—like learning to generate code to build a software that gets runtime feedback from compute environment given a goal—RL makes perfect sense for LLM-based methods. But the current RL in textbooks or research papers are not enough-- it needs temporal abstractions over sequence of actions--aka generated tokens in LLM world. It is impossible to learn optimal policies for micro-actions without temporal abstractions unless we assume we have infinite data and compute resource--remember it is an NP-hard combinatorial problem!
I think as we move into the era of experiential AI—agents that actually interact with environments and receive runtime feedback—the role of RL will become clearer and more necessary.
@Ulmo_Space@RichardSSutton I agree! I suppose it depends what we mean by knowledge-seeking. We have plenty pattern-recognition machines without the capacity for self-verification. The question is whether these pattern recognition machines have knowledge.
Have people seen this prescient 2001 post by @RichardSSutton on self-verification?
"An AI system can create and maintain knowledge only to the extent that it can verify that knowledge itself".
This sentiment underpins much LLM reasoning research today.
https://t.co/gK3DOwqYm8
Can the knowledge in language model representations guide the search for novel behaviors? We find that exploration with a simple, principled, representation-based bonus improves diversity and pass@k rates for inference-time and post-training!
Jane Goodall had a remarkable ability to inspire us to connect with the natural wonders of our world, and her groundbreaking work on primates and the importance of conservation opened doors for generations of women in science. Michelle and I are thinking of all those who loved and admired her.
@scaling01 I disagree. I think Dwarkesh is using imitation in the ML sense. I.e., the agent is given the correct answer to directly imitate. I think Rich, who is well-versed in behavioral psychology, is saying that imitation learning in animals occurs through experience.