Last week, I wrapped up my internship @cohere, where I had the chance to work with fantastic people on RL for LLMs.
It was an amazing 6 months, and I'm excited to share one of the outcomes: ShiQ, a Q-value based RL algorithm for fine-tuning LLMs 🚀
🧵Details in @irombie's post!
I'm excited to share our new pre-print
ShiQ: Bringing back Bellman to LLMs!
https://t.co/yWMT6M0nuT
In this work, we propose a new, Q-learning inspired RL algorithm for finetuning LLMs 🎉
(1/n)
🚀 Excited to share the 3rd outcome of my internship at @CohereAI: a new RL algo for agentic LLMs that combines policy learning and world modeling, letting agents verify actions before executing them.
Check out the 🧵 and 📄!
Big thanks to my co-authors and Cohere’s RL team 🙏
📢After months of work, I can finally share our latest research, couldn’t be more thrilled and excited. 🎉
We unify a policy 🤖 and a world model 🌍 into a single LLM, thus no external dynamics model needed!
Why does this matter? Because now, the policy can plan based on its internal world model!
And this planning boosts tool-use success rates to >90%, on top of SFT + RL.
📄: https://t.co/5z72BwWnGT
🧵[1/8]
Excited to share the technical report on Command R7B (7B) and Command A (111B), our flagship model! These models are the result of incredible teamwork at @cohere, and it was an honor to be part of it.
Report: https://t.co/0pOyajfQbe
Today (two weeks after model launch 🔥) we're releasing a technical report of how we made Command A and R7B 🚀! It has detailed breakdowns of our training process, and evaluations per capability (tools, multilingual, code, reasoning, safety, enterprise, long context)🧵 1/3.
📢 Deadline Extended! 📢
Due to multiple requests and the overlap with @RL_Conference and @RealAAAI, we’re extending the Adaptive Learning Agent workshop @AAMASconf submission deadline to March 1st (AOE)! 🚀
🔗 More details: https://t.co/Qz1XDw0TdU
🚨 Less than 48 hours left to submit to the 17th Adaptive Learning Agent workshop at @AAMASconf! 🚨
We welcome full papers, work in progress, and 2-page abstracts of recent journal papers. Don't miss the deadline!
🔗 More details: https://t.co/Qz1XDw0TdU
Exciting news! My paper on multi-objective reinforcement learning was accepted at AAMAS 2025!
We introduce IPRO (Iterated Pareto Referent Optimisation)—a principled approach to solving multi-objective problems.
🔗 Paper: https://t.co/U8Sx6B0q5A
💻 Code: https://t.co/Umf6oQXJBH
Still 8 days to submit your work to the ALA workshop at AAMAS! We welcome full papers, work in progress, and 2-page abstracts of recently published journal papers. All the info is available at https://t.co/wVu1Wp4uTX.
Excited to announce the 17th Adaptive Learning Agent workshop at @AAMASconf in May! We welcome full papers, work in progress, and 2-page abstracts of recently published journal papers. Find out more at our website: https://t.co/v672Oi2b3J. Deadline for submissions: February 4th.
Two weeks ago, I publicly defended my PhD thesis, entitled « Activating Formal Verification of Deep Reinforcement Learning Policies by Model Checking Bisimilar Latent Space Models ».
📚 The full dissertation is available here: https://t.co/Yvgjzvt31t
(1/n)
I also had the pleasure of presenting our latest work on Online Planning for POMDPs with State Requests (with E. Bargiacchi, A. Nowé, @DiederikRo, @faoliehoek). Check the paper here: https://t.co/FIN6Gn6U9P 2/3
Okay people, I need some help. We’re working on a project and have been stuck for a while. My final guess for what the issue may be is that gradients are not flowing as we would want them. Does anyone have a intuitive visualisation/debugging tool for gradient flows in jax?
Presenting work on synthetic preference generation at two #ICLR2024 workshops today: DPFM & GenAI4DM @genai4dm.
Come say hi to find out how to improve your reward model without collecting additional human feedback!
In clinical early warning systems (EWS), can we go beyond the model estimate of event occurrence and leverage its belief about the event distance to improve our alarm policy?
Introducing “Dynamic Survival Analysis for Early Event Prediction” with @ToManuelBurger and @gxr. 🧶
Arrived at #ICLR2024 with @f_delgrange to present our work "The Wasserstein Believer: Learning Belief Updates for Partially Observable MDPs through Reliable Latent Space Models".