Towards Learning to Imitate from a Single Video Demonstration

Glen Berseth, Florian Golemo, Christopher Pal; 24(78):1−26, 2023.

Abstract

Learning to imitate behaviors observed in videos, without having access to the internal state or action information of the observed agent, is crucial for agents operating in the natural world. However, developing a reinforcement learning (RL) agent that can achieve this goal is a major challenge. In this study, we address this challenge by using contrastive training to learn a reward function by comparing an agent’s behavior with a single demonstration. We employ a Siamese recurrent neural network architecture to learn rewards in both space and time between motion clips, while training an RL policy to minimize this distance. Through experimentation, we also discover that incorporating multi-task data and additional image encoding losses enhances the temporal consistency of the learned rewards, leading to significant improvements in policy learning. We validate our approach on simulated humanoid, dog, and raptor agents in 2D, as well as quadruped and humanoid agents in 3D. Our results demonstrate that our method surpasses current state-of-the-art techniques and is capable of learning to imitate behaviors from a single video demonstration.

[abs]

[pdf][bib]