Off-Policy Actor-Critic with Emphatic Weightings
Eric Graves, Ehsan Imani, Raksha Kumaraswamy, Martha White; 24(146):1−63, 2023.
Abstract
A variety of policy gradient algorithms have been developed for the on-policy setting based on the policy gradient theorem, which simplifies the gradient computation. However, the off-policy setting is more challenging due to multiple objectives and the lack of an explicit off-policy policy gradient theorem. In this study, we propose a unified off-policy objective and derive a policy gradient theorem for this objective. Our derivation incorporates emphatic weightings and interest functions. We introduce the Actor Critic with Emphatic weightings (ACE) algorithm, which approximates the gradients using different strategies. We provide a counterexample demonstrating that previous off-policy actor-critic methods, such as Off-Policy Actor-Critic (OffPAC) and Deterministic Policy Gradient (DPG), converge to the wrong solution, while ACE finds the optimal solution. We also explain why semi-gradient approaches can still perform well in practice and suggest variance reduction strategies in ACE. We conduct empirical experiments on two classic control environments and an image-based environment to illustrate the tradeoffs of each gradient approximation. Our results show that ACE, by directly approximating the emphatic weightings, performs as well as or better than OffPAC in all tested settings.
[abs]