In this paper, we present a novel approach to visual model-based RL methods. These methods often encode image observations into low-dimensional representations, but they do not effectively eliminate redundant information. As a result, they are vulnerable to spurious variations, such as changes in background distractors or lighting conditions, that are not relevant to the task at hand.
To address this issue, we propose a visual model-based RL method that learns a latent representation that is resilient to these spurious variations. Our training objective focuses on maximizing the predictive power of the representation for dynamics and reward, while also constraining the flow of information from the observation to the latent representation. By doing so, we significantly enhance the resilience of visual model-based RL methods to visual distractors, allowing them to operate effectively in dynamic environments.
Furthermore, we demonstrate that although the learned encoder is resilient to spurious variations, it is not invariant to significant distribution shifts. To overcome this limitation, we introduce a simple reward-free alignment procedure that enables the encoder to adapt during test time. This adaptation allows for quick adjustments to widely differing environments without the need to relearn the dynamics and policy.
Our work represents a significant step towards making model-based RL a practical and valuable tool for dynamic and diverse domains. We validate the effectiveness of our approach through simulation benchmarks that include significant spurious variations, as well as a real-world egocentric navigation task with noisy TVs in the background.
For more information, including videos and code, please visit our website: https://zchuning.github.io/repo-website/.