🚀 How can we make LLM-based optimization stable and scalable when the feedback signal is stochastic?
Introducing POLCA: a framework for robust, scalable stochastic generative optimization.
Paper: https://t.co/xgdjISRxtE
Code: https://t.co/9TRuyvxVcf
🧵👇 1/
11. When Does Trajectory-Level Supervision Permit Efficient Offline Reinforcement Learning?
🔑 Keywords: Offline reinforcement learning, Outcome supervision, Pessimistic actor-critic, OPAC, Sample efficiency
💡 Category: Reinforcement Learning
🌟 Research Objective:
- To develop a statistical theory for policy optimization from trajectory-level outcome supervision in offline reinforcement learning, addressing challenges using a pessimistic actor-critic approach.
🛠️ Research Methods:
- Introduced the OPAC algorithm which utilizes a latent reward model for optimizing policy via trajectory-level labels, and extended the method to preference-based feedback to uphold statistical guarantees.
💬 Research Conclusions:
- Identified circumstances where outcome-level supervision is sample-efficient for offline control and formed conditions under which generalized outcome-based offline RL remains tractable, highlighting fundamental statistical barriers with missing process-level rewards.
👉 Paper link: https://t.co/QQlddw7uym
Looking for Google research student researcher (PhD student) to work on LLM and agent related learning.
Preferred background: RL/game theory, agentic system, LLM training.
Candidate will work closely with me and @allenainie
Email me if you are interested. 😀
Hiring a student researcher for RL agents, co-hosted by @chinganc_rl and me at Google Research and DeepMind.
Our work in the last 2 years:
https://t.co/xSuqW40wai
https://t.co/aPEK8jQczC
https://t.co/LxkvCzNlEW
Any interest? DMs are open or email us!
@jubayer_hamid@allenainie Since the empirical mean already yielded strong results, we focused on our core contributions rather than hyper-parameter tuning for UCB. However, leveraging different selection diversities to instantiate diverse search algorithms remains an interesting future direction.
LLM has been struggling to solve search and optimization at scale when feedback is stochastic. We propose a simple solution, POLCA, using text embedding with “provable” guarantee. Excited to see the first theoretically correct work of LLM optimization. Kudos to @XuanfeiRen
Well, not for nothing -- we found a way to use Gemini embeddings to improve LLM-driven search algorithms. With a simple accept/reject rule in the embedding space, you get a provable guarantee on search result.
🚀 How can we make LLM-based optimization stable and scalable when the feedback signal is stochastic?
Introducing POLCA: a framework for robust, scalable stochastic generative optimization.
Paper: https://t.co/xgdjISRxtE
Code: https://t.co/9TRuyvxVcf
🧵👇 1/
🚀 We believe POLCA is a step toward making LLM-driven automated search more reliable, scalable, and principled. As LLMs are increasingly used to optimize prompts, agents, and code, stability under noise becomes essential—not optional. 10/