MT-Opt: Continuous Multi-Task Robotic
Reinforcement Learning at Scale
arXiv 2021

Abstract

General-purpose robotic systems must master a large repertoire of diverse skills to be useful in a range of daily tasks. While reinforcement learning provides a powerful framework for acquiring individual behaviors, the time needed to acquire each skill makes the prospect of a generalist robot trained with RL daunting. In this paper, we study how a large-scale collective robotic learning system can acquire a repertoire of behaviors simultaneously, sharing exploration, experience, and representations across tasks. In this framework new tasks can be continuously instantiated from previously learned tasks improving overall performance and capabilities of the system. To instantiate this system, we develop a scalable and intuitive framework for specifying new tasks through user-provided examples of desired outcomes, devise a multi-robot collective learning system for data collection that simultaneously collects experience for multiple tasks, and develop a scalable and generalizable multi-task deep reinforcement learning method, which we call MT-Opt. We demonstrate how MT-Opt can learn a wide range of skills, including semantic picking (i.e., picking an object from a particular category), placing into various fixtures (e.g., placing a food item onto a plate), covering, aligning, and rearranging. We train and evaluate our system on a set of 12 real-world tasks with data collected from 7 robots, and demonstrate the performance of our system both in terms of its ability to generalize to structurally similar new tasks, and acquire distinct new tasks more quickly by leveraging past experience.

Video


Approach

To collect diverse, multi-task data at scale, we create an intuitive success-detector-based approach that allows us to quickly define new tasks and their rewards. We train a multi-task success detector using data from all the tasks and continuously update it to accommodate for distribution shifts caused by various real-world factors such as varying lighting conditions and changing background surroundings. In addition, we provide a data collection strategy to simultaneously collect data for multiple distinct tasks across multiple robots. In this strategy, we use solutions to easier tasks to effectively bootstrap learning of more complex tasks. Over time, this allows us to start training a policy for the harder tasks, and consequently, to collect better data for those tasks.

The robots generate episodes which then get labelled as success or failure for the current task. These episodes are then copied and shared across other tasks to increase the learning efficiency. The balanced batch of episodes is then sent to our multi-task RL training pipeline to train the MT-Opt policy.

Results

We train MT-Opt on a dataset of 9600 robot hours collected with 7 robots. We use offline multi-task reinforcement learning, and learn a wide variety of skills that include picking specific objects, placing them into various fixtures, aligning items on a rack, rearranging and covering objects with towels.



When compared to single-task baselines, our MT-Opt system performs similarly on the tasks that have the most data (e.g. generic lifting task at 89% success), while significantly improving performance of tasks underrepresented in the dataset - 50% average success rate on rare tasks compared to 1% achieved with a single-task QT-Opt baseline and 18% success achieved with a naive multi-task QT-Opt baseline.


Using this large pre-trained model not only can we generalize to new but similar tasks in zero-shot, but also we can quickly (in ~1 day of data collection on 7 robots) fine-tune our system to new, previously unseen tasks, such as a towel-covering task shown below (resulting in 92% success rate of towel-picking and 79% success rate of object-covering), which wasn’t present in our original dataset.

Citation

Acknowledgements

The authors would like to thank Josh Weaver, Noah Brown, Khem Holden, Linda Luu and Brandon Kinman for their robot operation support. We also thank Yao Lu and Anthony Brohan for their help with distributed learning and testing infrastructure. Tom Small for help with videos and project media. Tuna Toksoz and Garrett Peake for improving the bin reset mechanisms. Julian Ibarz, Kanishka Rao, Vikas Sindhwani and Vincent Vanhoucke for their support. Satoshi Kataoka, Michael Ahn, and Ken Oslund for help with the underlying control stack, and the rest of the Robotics at Google team for their overall support and encouragement. All of these contributions were incredibly enabling for this project.

The website template was borrowed from Jon Barron.