Variational Walkback: Learning Transition Operator as a Stochastic Recurrent Net
https://t.co/RnUU3JYpST
Start at real data point deliberately walk away from it by applying the model with increasing noise, then train the same transition operator to walk back toward the data.
🎉 "CatFlow: Co-generation of Slab-Adsorbate Systems via Flow Matching" has been accepted at #ICML2026!
We develop flow matching for catalyst design at the all-atom level with a factorized representation.
Thanks to @__na__young__ , @honghui_kim , and @sungsoo_ahn_
We are looking for talented people interested in AI for Science, including ML for molecules, materials, and scientific discovery.
If you are interested, please feel free to DM or email me. I am happy to chat and answer any questions.
A 10 million parameter model just outperformed deterministic rivals 3 times its size by doing something regular recursive AI dont do: exploring multiple reasoning paths at the same time.
Most AI reasoning models are trapped on a single train of thought, and GRAM ("Generative Recursive Reasoning") is the first to break that by letting the model think in parallel universes simultaneously.
The problem is that all existing recursive models are fully deterministic, meaning given the same input they always follow the exact same reasoning path and can never escape a wrong trajectory or discover more than 1 valid answer.
GRAM fixes this by injecting learned randomness at each refinement step, so the model samples a slightly different direction each time rather than snapping to 1 fixed next state, which produces a spread of diverse reasoning trajectories.
At test time the model runs many of these paths in parallel and selects the best one using a small reward predictor trained alongside the main model, adding a "width" scaling axis on top of the usual "depth" axis of running more recursion steps.
On hard Sudoku puzzles, GRAM with 10M parameters hits 97% accuracy versus 87.4% for the best prior recursive model, and with only 20 parallel samples it outperforms every deterministic baseline even at 320 recursion steps.
On tasks with many valid answers like N-Queens, deterministic recursive models collapse as the number of solutions grows, while GRAM maintains near-perfect accuracy throughout.
The same stochastic framework also acts as a generator: given a blank board, GRAM produces valid Sudoku puzzles 99% of the time using 16 steps, versus 1,000 steps and 55M parameters for the best diffusion baseline at just 91%.
---
Paper Link – arxiv. org/abs/2605.19376v1
🧠We introduce "Generative Recursive Reasoning"!
Recursive Reasoning Models like HRM, TRM, and Looped Transformers are deterministic — same input, same reasoning, every time. They collapse the entire space of plausible reasoning paths into a single attractor.
Our model GRAM (Generative Recursive reAsoning Models) turns recursion itself into a stochastic latent trajectory. Multiple hypotheses, alternative solution strategies, and inference-time scaling not just by depth, but by width — parallel trajectory sampling.
And here's the kicker: the same formulation that gives us conditional reasoning p(y|x) also makes GRAM a general generative model p(x).
With only 10M params:
• Sudoku-Extreme: 97.0% (TRM 87.4%)
• ARC-AGI-1: 52.0%
• ARC-AGI-2: 11.1%
• N-Queens coverage: 90%+
📄 Paper: https://t.co/JC7EyXYc9Y
🌐 Project page: https://t.co/LRT1dQiWLZ
w/
Junyeob Baek @JunyeobB (KAIST),
Mingyu Jo @pyross0000 (KAIST),
Minsu Kim @minsuuukim (KAIST & Mila),
Mengye Ren @mengyer (NYU),
Yoshua Bengio @Yoshua_Bengio (Mila),
Sungjin Ahn @SungjinAhn_ (KAIST)
KAIST AI (College of AI) is hiring!
If you are attending ICML 2026 in Seoul and are interested in faculty or postdoc positions at KAIST AI Computing (and CS), feel free to reach out by filling out this short interest form:
https://t.co/PBAkgFXaJT
We are looking for researchers across broad areas of AI and Computer Science, including ML, NLP, CV, HCI, Systems and more.
Please share with anyone who may be interested!
This Google/Cambridge ICLR 2026 paper introduces Visual Planning, a novel reinforcement learning framework (VPRL) that enables purely visual reasoning through sequences of images, outperforming text-only planning in visual navigation tasks and offering a promising supplement to language-based reasoning for "vision-first" challenges.
Read with an AI tutor: https://t.co/WIiYI4uMg0
PDF: https://t.co/GAOFKXj2qT
The scientific process involves collecting informative measurements while effectively allocating limited resources. We developed MaD-Physics, a new benchmark to measure this capability of agents.
Do we really need to hard-code synthesis routes into the generative process to obtain synthesizable molecules?
Our ICML 2026 paper suggests another route.
Huge credit to @HyeonahKimm , @alexhdezgcia , Celine Roget, Dionessa Biton, Louis Vaillancourt, Yves V. Brun, @Yoshua_Bengio
In S3-GFN, we keep the molecular generator sequence-based, initialize it from a rich SMILES prior, and induce synthesizability through soft distributional post-training.
Rather than treating synthesizability as a hard action-space constraint or simply folding it into scalar reward shaping, we maintain positive/negative replay buffers and use a contrastive auxiliary loss to separate synthesizable and unsynthesizable regions in probability space.
This gives a simple but flexible way to steer GFlowNet sampling toward high-reward, synthesizable molecules while retaining the benefits of pretrained chemical language models.
(1/4)
Excited to share that our paper “Active Attacks: Red-teaming LLMs via Adaptive Environments” has been accepted to ICML 2026.
Joint work with Taeyoung Yun, Pierre-Luc St-Charles, Jinkyoo Park, and @Yoshua_Bengio .
We study automated red-teaming: training an attacker LLM to generate diverse attack prompts that expose failure modes in a victim LLM, then using those prompts to improve safety tuning.
Can stronger adaptive attacks make LLMs safer?
More below 🧵
3/3
Technically, we combine adaptive environments with soft/off-policy RL and replay training.
The result is broader coverage of harmful modes and stronger safety tuning: cross-attack success vs GFlowNet baselines improves from 0.07% to 31.28%, with only ~6% extra compute.
Paper: https://t.co/NDIFVTyqTW
2/3
Active Attacks makes the red-teaming environment adaptive.
After each round, we safety-finetune the victim on discovered attacks. This lowers reward in already-exploited regions, pushing the attacker to search for new vulnerabilities.
This creates an active-learning-like, easy-to-hard curriculum.
@Yoshua_Bengio 1/3
A key challenge in RL-based red-teaming is diversity.
If an attacker finds a few easy high-reward attack modes, it can keep exploiting them. This may give high attack success, but poor coverage of failure modes — exactly what we do not want for robust safety tuning.
Empirically, this simple recipe works surprisingly well:
S3-GFN achieves high synthesizability (often >95%) while maintaining strong optimization performance across molecular design tasks.
So the takeaway is:
you may not need to hard-code synthesis routes to generate synthesizable molecules — a strong sequence prior + soft constrained post-training can already go a long way.
(4/4)
Our approach, S3-GFN, keeps the generator sequence-based (SMILES) and starts from a rich chemical prior.
Instead of baking synthesizability into the MDP as a hard constraint, we induce it through soft distributional post-training.
The core mechanism is simple:
- maintain positive / negative replay buffers
add a contrastive auxiliary loss
- suppress unsynthesizable regions while preserving reward-seeking behavior
(3/4)