Excited that our AI Control Roadmap is out. We built an internal monitoring system that reviews agent trajectories and classifies them against our threat taxonomy. We've now analyzed 1M+ coding agent tasks. Always happy to connect with others working on AI control and oversight!
Instead of assuming AI will always do what we intend, we ask: what if it doesn't?
That’s why we’ve developed our AI Control Roadmap: a framework for building and managing the advanced AI we deploy within Google. 🧵
Key takeaway: most flagged events aren't adversarial, but from agents misinterpretation or being overeager to hit a user's goal. That insight fed directly into the live monitors we built for Gemini Spark agents that catch things like unintentional data deletion in real time.
Excited that our AI Control Roadmap is out. We built an internal monitoring system that reviews agent trajectories and classifies them against our threat taxonomy. We've now analyzed 1M+ coding agent tasks. Always happy to connect with others working on AI control and oversight!
Instead of assuming AI will always do what we intend, we ask: what if it doesn't?
That’s why we’ve developed our AI Control Roadmap: a framework for building and managing the advanced AI we deploy within Google. 🧵
Long road to Gemma 4. Learned a ton from the people around me along the way. Huge congrats to the whole team, and can't wait to see what you build with it!
Excited to launch Gemma 4: the best open models in the world for their respective sizes. Available in 4 sizes that can be fine-tuned for your specific task: 31B dense for great raw performance, 26B MoE for low latency, and effective 2B & 4B for edge device use - happy building!
Ran into @lexfridman by chance earlier this week and finally sat down today. Perfect way to wrap up #NeurIPS. It’s been 6 years (how time flies!) since we worked together at MIT. Great catching up on life, ML, and old times.
Thrilled to announce I’ve successfully defended my PhD! 🎓 Deeply grateful to my advisor Lee Spector, my committee @scottniekum, @MajiSubhransu, @jeffclune, and all collaborators, friends, and family. Milestone achieved, excited for the next chapter!
Preferences in RLHF often come from many people with differing values. Ryan's work explores how to infer a set of representative reward functions that captures that diversity, so that we can better reason about risk and fairness in these settings.
POPL enhances the safety and fairness of RLHF by aligning agents and LLMs with diverse human values. It effectively addresses hidden contexts in preferences, ensuring risk-sensitive alignment without additional labeling.
Led by @RyanBoldi, w/ Lee Spector and @scottniekum.
Excited to share our new paper on Pareto Optimal Preference Learning (POPL)! 🎉
POPL aims to better align AI with diverse human values by building diverse sets of reward functions or policies!
https://t.co/rHXbjZANBa
Work done with @li_ding_ , Lee Spector and @scottniekum
Special thanks to @andrewdai99 for insightful suggestions on the paper! As we prepare to share the latest version, check out our previous spotlight talk at NeurIPS @aloeworkshop: https://t.co/Ml4ShnBvo1, and the recent tutorial in @pyribs (w/ @btjanaka): https://t.co/bB98b16otP.
In a nutshell, new results highlight QDHF's strength in handling complex prompts for GenAI. While diffusion models often struggle with composing objects and attributes correctly, QDHF enhances diversity to explore various compositions, thus improving the quality of responses.
Exciting news! Our QDHF paper has been accepted at #ICML2024! Huge thanks to incredible coauthors: @jennyzhangzt, @jeffclune, Lee Spector, and @joelbot3000. Stay tuned for updates on arXiv with our latest findings!
Project website: https://t.co/p4yRANZzD7 👇
🚀Thrilled to release the QDHF tutorial in @pyribs! Big shoutout to @btjanaka for his meticulous editing and insightful feedback👏. Dive into the tutorial to explore how QDHF enhances GenAI models with diversified, high-quality responses and apply these insights to your projects!
Ecstatic to announce the release of @pyribs 0.7.1! The absolute highlight of this release is the QDHF (Quality Diversity through Human Feedback) tutorial contributed by @li_ding_! The tutorial is available here and runs on Google Colab in ~1 hour:
https://t.co/QBJDkKzJQs
Had a great time yesterday presenting our QDHF work at #NeurIPS2023@aloeworkshop! Fantastic workshop with great vibes. Glad to meet so many people in open-endedness, QD, RL, especially my amazing collaborators @jennyzhangzt and @jeffclune! (Thanks to @RyanBoldi for this photo!)
Check out the QDHF project website: https://t.co/FydVelYLei. We also released our code on GitHub: https://t.co/rNqkVm0oLt. Stay tuned for more updates in the upcoming weeks!
Hello #NeurIPS2023! I will present our work on Quality Diversity through Human Feedback tomorrow 12/15 at @aloeworkshop. Feel free to stop by Room 211-213 for our spotlight talk (4:15-4:30) or catch us during the poster session (12:45-1:45). Let's chat about learning and more!
Thrilled to announce the first annual Reinforcement Learning Conference @RL_Conference, which will be held at UMass Amherst August 9-12!
RLC is the first strongly peer-reviewed RL venue with proceedings, and our call for papers is now available: https://t.co/QiZcJYM1MR.
Thrilled to announce the first annual Reinforcement Learning Conference @RL_Conference, which will be held at UMass Amherst August 9-12! RLC is the first strongly peer-reviewed RL venue with proceedings, and our call for papers is now available: https://t.co/2yf5nruvpl. 🧵