Toward general dexterity
We are training robot policies that learn a rich and coherent physical representation of the world by conditioning on multimodal observations
Here’s a small glimpse of what we’ve been building:
Task: Pick the ramen cup and place it in the box
Given the current world state, the model generates a multimodal trajectory; here we show the decoded video and the corresponding actions executed on the humanoid
🏆 Grand Prize Winners: Daydreamer
@diegocaples@_gundawar
They're tackling the "GPT Moment for Robotics."
Their agent uses a video diffusion model to imagine a successful outcome, executes it in the real world, and then uses VLM feedback to self-improve, training only on its successes.
🎉🥳Congratulations to Yochanite Atharva Gundawar (@_gundawar) who successfully defended his MS thesis on "Scaling LLMs with LLM-Modulo" today! He is off to The AGI Company (really!)..
On the use of Verifiers with LLMs--External vs. Internal LLM-Modulo (with @kayastechly, @karthikv792 and @21st_Warlock )
We are a bit tickled that verifiers seem to be all the rage on the AI twitter, as we have been advocating use of external verifiers of various hues--hand-written, learned, synthesized, or even LLM-based as part of LLM-Modulo (https://t.co/BglEooq5rF)
While LLM-Modulo focused on the use of verifiers in the inference stage, the current spike in the interest on verifiers is of course because of their use in the RL post-training stage of Deepseek R1.
In our own group, we are calling this use of verifiers in training (as a way to provide reward signal to the trajectories generated by the base LLM) Internal LLM-Modulo (with the retronym external LLM-Modulo for our original use). The two uses can of course be combined fruitfully, as our work on LRM-Modulo shows (cf https://t.co/RqRf4fWjrU)
There are some interesting tradeoffs at play in internal vs. external LLM-Modulo as discussed below:
[Verifier use Setup:] Simply put verifiers in the external LLM-Modulo are used to overlay a generate-test cycle on top of the LLM generations by checking whether a generated solution is correct, and if not pushing LLM to generate other alternatives (with optional critiques provided by the verifier). https://t.co/mREKgH8mxk
The verifiers in the internal LLM-Modulo do the same job of checking if the generated solution (not the intermediate tokens dubbed "reasoning pattern") is correct, and uses RL to reward the traces of the base LLM leading to verified correct solutions. (c.f. https://t.co/0htnB9tT4S).
(While our external LLM-Modulo setup does talk about incrementally collecting synthetic data (Step 7), one important difference is that we were suggesting solution data (to be used for fine tuning) rather than the derivational trace data--which is used by the RL post training in R1. )
[Correctness/Safety:] The RL post-training in internal LLM-Modulo doesn't necessarily guarantee correctness of the solution during inference stage; thus external LLM-Modulo still needs to be used in the inference stage to ensure safety/correctness etc of the solutions output in the inference stage (c.f. https://t.co/uk5ZKimWIM).
[Correctness of reasoning patterns vs. solutions:] Note that the internal LLM-Modulo is basically checking the correctness of the final solution rather than the intermediate tokens that have been dubbed reasoning patterns. Thus there are no real guarantees about the external significance of the reasoning patterns & .
[Composability of Verifiers:] One of the arguments we make in favor of verifiers as against solvers, in the external LLM-Modulo case, is that you can have a bank of verifiers with each verifier (partially) guaranteeing the correctness of some aspects of the eventual solution. This idea can also potentially be of use in internal LLM-Modulo (i.e., training phase)--with the reward being composed out of the signals from the bank of verifiers.
[Verification vs. Critique:] In the external LLM-Modulo, we considered the verifier providing critiques to bias the base LLM's generation of the next candidate, and found it to be useful. In contrast, R1's training phase seems to just let the LLM generate multiple candidates in parallel and use verifiers only to provide reward signals. It would be interesting to see whether critique may help in the sample complexity of training.
@Scobleizer@alanmelling@Jandodev Another benefit is lower cost, these promts are generally much smaller than if you were to write it out in human readable format. (From a conv I had with a founding member)
Using the generate-critic framework LLM Modulo, we were able to increase the accuracy of GPT 4 Turbo on the travel planning benchmark by 5X. We are working on improving these results, stay tuned!
📢 Next Tuesday 7/23 afternoon, I will be presenting our #ICML2024 spotlight poster "Position: LLM's Can't Plan, But Can Help Planning in LLM-Modulo Frameworks" (Hall C 4-9 #710).
@icmlconf poster page: https://t.co/F5zIdCte0Z