Synthetic data will provide the next trillion tokens to fuel our hungry models.
I'm excited to announce MimicGen: massively scaling up data pipeline for robot learning! We multiply high-quality human data in simulation with digital twins.
Using < 200 human demonstrations, MimicGen can autonomously generate > 50,000 training episodes across 18 tasks, multiple simulators, and even in the real-world!
The idea is simple:
1. Humans tele-operate the robot to complete a task. It is extremely high-quality but also very slow and expensive.
2. We create a digital twin of the robot and the scene in high-fidelity, GPU-accelerated simulation.
3. We can now move objects around, replace with new assets, and even change the robot hand - basically augment the training data with procedural generation.
4. Export the successful episodes, and feed that to a neural network! You now have an near-infinite stream of data.
One of the key reasons that robotics lags far behind other AI fields is the lack of data: you cannot scrape control signals from the internet. They simply don't exist in-the-wild.
MimicGen shows the power of synthetic data and simulation to keep our scaling laws alive. I believe this principle apply beyond robotics. We are quickly exhausting the high-quality, real tokens from the web. Artificial intelligence from artificial data will be the way forward.
We are big fans of the OSS community. As usual, we open-source everything, including the generated dataset!
- Website: https://t.co/4pEZ2igP2u
- Paper: https://t.co/O7qi3FTBIs
- Dataset is hosted on HuggingFace (thanks @_akhaliq!!): https://t.co/E9ryjWNzBE
- Code: https://t.co/7Blv1Z5F09
MimicGen is led by @AjayMandlekar, deep dive in the thread:
Introducing DoctorGPT! After applying fine-tuning, reinforcement learning, & compilation techniques to Meta's Llama2 model, I got amazing results:
- Passes the US Medical Licensing Exam
- Offline
- iOS & Android
- Open Source
Code:
https://t.co/BUJyW5rx7N
Full video tutorial: