Provably Sample-Efficient Model-Free Algorithm for MDPs with Peak Constraints

Qinbo Bai, Vaneet Aggarwal, Ather Gattami; 24(60):1−25, 2023.

Abstract

In the field of optimization for dynamic systems, it is common for variables to have constraints. These types of problems can be represented as Constrained Markov Decision Processes (CMDPs). This paper focuses on the peak Constrained Markov Decision Process (PCMDP), where the agent’s goal is to select a policy that maximizes the total reward within a finite horizon while also satisfying constraints at each epoch with a probability of 1. We propose a model-free algorithm that transforms the PCMDP problem into an unconstrained problem, using a Q-learning based approach. We introduce the concept of probably approximately correct (PAC) policy to address the PCMDP problem. The algorithm we propose is proven to achieve an $(\epsilon,p)$-PAC policy when the number of episodes $K\geq\Omega(\frac{I^2H^6SA\ell}{\epsilon^2})$, where $S$ and $A$ represent the number of states and actions, respectively. $H$ is the number of epochs per episode. $I$ is the number of constraint functions, and $\ell=\log(\frac{SAT}{p})$. It is worth noting that this is the first result on PAC-type analysis for PCMDP with peak constraints, where the transition dynamics are not known in advance. We demonstrate the effectiveness of the proposed algorithm through applications in an energy harvesting problem and a single machine scheduling problem, where it performs close to the theoretical upper bound of the optimization problem under study.

[abs]

[pdf][bib]