Excited to share that our lab will present two Orals at the ICLR SPOT workshop this Monday:
• Maximum Likelihood Reinforcement Learning (10:10–10:20) — 🏆 Best Paper Award
• Expanding the Capabilities of Reinforcement Learning via Text Feedback (10:20–10:30) — Oral + 🏆 Outstanding Paper Award at LLA Workshop
Come and say hi!
Are we done with new RL algorithms? Turns out we might have been optimizing the wrong objective.
Introducing MaxRL, a framework to bring maximum likelihood optimization to RL settings.
Paper + code + project website: https://t.co/j9BCBF7K3R
🧵 1/n
Huge congratulations to Dr. 𝐇𝐚𝐫𝐬𝐡 on receiving the 2025 𝐈𝐊𝐃𝐃 𝐃𝐨𝐜𝐭𝐨𝐫𝐚𝐥 𝐃𝐢𝐬𝐬𝐞𝐫𝐭𝐚𝐭𝐢𝐨𝐧 𝐀𝐰𝐚𝐫𝐝! 🏆
We are incredibly proud to celebrate a 𝐡𝐚𝐭-𝐭𝐫𝐢𝐜𝐤 of successes for VAL:🔹2025: Harsh (Winner)🔹2024: Sravanti (Winner)🔹2023: Jogendra (Runner-Up)
Excited to share our NeurIPS paper: “Improving Model-Based Reinforcement Learning by Converging to Flatter Minima”. 🚀
TL;DR: make world-model training seek flatter minima and you get more robust model-based RL, with big gains on challenging benchmarks. 1/n
The method is simple to use:
• no architectural changes
• small compute overhead (one extra SAM-style step)
• works across pixel + state inputs and very different planners.
If you already train a world model, this is nearly plug-and-play. 9/n
I will be at #NeurIPS2025 (Dec 1–7)! 📷 Would love to connect and chat about model-based RL, policy robustness, beyond policy gradient and their implications to LLM.
I am actively seeking PhD positions in the aforementioned areas.
[0/3]
🚀 Introducing Verlog – an open-source RL framework built specifically for training long-horizon, multi-turn LLM agents.
📊 Max episode length comparison:
•VeRL / RAGEN → ~10 turns
•verl-agent → ~50 turns
•Verlog (ours) → 400+ turns 🔥
⚙️ Technical foundation:
•Built on top of the VeRL
•Tested on the BALROG benchmark (BabyAI, BabaIsAI, Crafter)
•Followed design principles from pytorch-a2c-ppo-acktr-gail
💡 Why Verlog?
•For researchers: Skip the heavy engineering. We give you a strong, validated baseline for long-horizon, multi-turn LLM agent across diverse tasks.
•For developers: Train on your own long-horizon environments with minimal setup.
•Algorithmic edge: With a well-trained value function as an intermediate supervised signal, rollouts can be truncated at any point and still be used for learning. This reduces GPU idle time and boosts training efficiency. This is a genuine advantage of PPO over the GRPO family, widely recognized and leveraged in classic RL, yet often overlooked in LLM agent frameworks.
Key features 🧵👇
RL with verifiable reward has shown impressive results in improving LLM reasoning, but what can we do when we do not have ground truth answers?
Introducing Self-Rewarding Training (SRT): where language models provide their own reward for RL training!
🧵 1/n
Vision and AI Lab (VAL), IISc has been recognized as the top AI lab in India by @CSrankings 🥇🎉, reflecting a decade of dedicated research. IISc is also ranked #1 in AI research nationwide 🥇. Thanks to our amazing team for their hard work and commitment🙏 #AI#CV#ML#IISc