Deploying language models in scientific discovery domains requires extraordinary amounts of test-time compute for search algorithms. An ideal training algorithm should be designed with this goal in mind - that we want agents to learn how to not only exploit but also optimistically explore novel strategies. The agent should learn how to synergistically explore and exploit.
We propose Poly-EPO, a set RL algorithm that explores and discovers diverse reasoning paths. Work with @jubayer_hamid (co-lead), Shreya, @ShirleyYXWu, @HengyuanH, @noahdgoodman, @DorsaSadigh, and @chelseabfinn.
How can we autonomously improve LLM harnesses on problems humans are actively working on?
Doing so requires solving a hard, long-horizon credit-assignment problem over all prior code, traces, and scores.
Announcing Meta-Harness: a method for optimizing harnesses end-to-end
How can robot policies be trained to best leverage VLMs' CoT reasoning and in-context learning for generalization?
The key is Steerable Policies: vision-language-action models that can be flexibly controlled in many ways!
https://t.co/GvcvmY0JD5
1/9
Exploration is fundamental to RL. Yet policy gradient methods often collapse: during training they fail to explore broadly, and converge into narrow, easily exploitable behaviors. The result is poor generalization, limited gains from test-time scaling, and brittleness on tasks where strategic exploration is necessary. We introduce a framework for training a policy over sets of generations and use it to induce exploration.
Work with @ifdita_hasan (co-lead), @ellenjxu_ , @chelseabfinn and @DorsaSadigh at Stanford 🧵
Very happy to share that our work on learning long-history policies received the Best Paper Award from the Workshop on Learned Robot Representations @RoboticsSciSys ! 🤖🥳
Check out our paper if you haven't already! https://t.co/Wfbz58lF8D
Thank you to all the organizers and the amazing collaborators @tangerinecoder, @liu_yuejiang and @chelseabfinn!
Even the smartest LLMs can fail at basic multiturn communication
Ask for grocery help → without asking where you live 🤦♀️
Ask to write articles → assumes your preferences 🤷🏻♀️
⭐️CollabLLM (top 1%; oral @icmlconf) transforms LLMs from passive responders into active collaborators.
Website: https://t.co/Aq654MbyTL
Github: https://t.co/wlP8eByqSA
Blog: https://t.co/gBNJojNY5O
Paper: https://t.co/SfHH6ruqsS
🎯 Key insight: Rewards responses not by immediate helpfulness, but by their long-term impact on the conversation trajectory.
@MSFTResearch@StanfordAILab@stanfordnlp
How can robots autonomously handle ambiguous situations that require commonsense reasoning?
*VLM-PC* provides adaptive high-level planning, so robots can get unstuck by exploring multiple strategies.
Paper: https://t.co/UmR6raIPiW
How do we make a scalable RL recipe for robots?
We study batch online RL w/ demos.
Key findings:
- iterative filtered imitation is insufficient
- need diverse policy data, eg using diffusion policy
- policy extraction can hinder data diversity
Paper: https://t.co/LsNtv4cRkU
🧠Memory is crucial for robots — to handle occlusions, track progress, stay coherent, etc. Yet, most VLA truncate context.
🤔Why is long-context hard for robot policies? And how can we fix it?
📄Our new paper: Learning Long-Context Diffusion Policies via Past-Token Prediction
Was super fun exploring this! Most modern policies don't use history -- Diffusion Policy in particular gets a lot worse. We identify a simple ingredient for history improvement, and use it to improve efficiency and performance of long-context policies.
Giving history to our robot policies is crucial to solve a variety of daily tasks. However, diffusion policies get worse when adding history. 🤖
In our recent work we learn how adding an auxiliary loss that we name Past-Token Prediction (PTP) together with cached embeddings enables us to reliably add longer history context to our robot policies! 🧠
We also show how PTP enables some test-time scaling techniques for robotics! 🚀
@sanjehorah i have an ad-hoc version in LaTeX annotated with date added/finished, stage (to read, annotate, done), priority, and stream (from Twitter, robotics, neuro, etc.) -- lmk if ppl get together to build as i have opinions
I’m excited to share a project I’ve been working on for over a year, which I believe will fundamentally change our approach to language models.
We’ve designed a new architecture, which replaces the hidden state of an RNN with a machine learning model. This model compresses context through actual gradient descent on input tokens. We call our method “Test-Time-Training layers.”
TTT layers directly replace attention, and unlock linear complexity architectures with expressive memory, allowing us to train LLMs with millions (someday billions) of tokens in context.
Our instantiations, TTT-Linear and TTT-MLP, both match or beat the strongest Transformers and Mamba. Arxiv: https://t.co/3eEenKB17s
We've had over a thousand new engineers try Quilter in the last few weeks submitting some really interesting designs. We really want to see some of these come to life, so we're subsidizing board builds!
If you want to build a Quilter design in real life, we'll cover the cost of the PCB! More about this in the link.
Open hardware community: this one is especially for you ;)
Very excited to introduce ROAM, our new work that allows a robot to *adapt on-the-go* as it faces OOD situations during deployment, drawing on pre-trained behaviors.
See as ROAM enables our Go1 to roller skate zero-shot 🤖🐕🛼 (without any lessons!)
🧵(1/9)
We’ve had a flurry of product launches over the past week. Unless you’ve been on X every day, you likely missed a couple.
Here’s a recap of every launch so you can get up to speed👇