Continuous-Time q-Learning: A Study
By Yanwei Jia and Xun Yu Zhou; 24(161):1−61, 2023.
Abstract
This study focuses on the continuous-time counterpart of Q-learning for reinforcement learning (RL) using the entropy-regularized, exploratory diffusion process formulation introduced by Wang et al. (2020). Since the conventional Q-function collapses in continuous time, we introduce its first-order approximation called the “(little) q-function.” This function is associated with the instantaneous advantage rate function and the Hamiltonian. We develop a q-learning theory based on the q-function that is independent of time discretization. By considering a stochastic policy, we establish martingale conditions for certain stochastic processes to characterize the associated q-function and value function in both on-policy and off-policy settings. We then utilize this theory to design different actor-critic algorithms for solving RL problems, depending on the availability of an explicit computation for the density function of the Gibbs measure generated from the q-function. One of our algorithms is an interpretation of the well-known Q-learning algorithm SARSA, while another algorithm recovers a policy gradient (PG) based continuous-time algorithm proposed in Jia and Zhou (2022b). Finally, we conduct simulation experiments to compare the performance of our algorithms with PG-based algorithms in Jia and Zhou (2022b) and time-discretized conventional Q-learning algorithms.
[abs]