Reinforcement Learning for Joint Optimization of Multiple Rewards
Mridul Agarwal, Vaneet Aggarwal; 24(49):1−41, 2023.
Abstract
To find optimal policies that maximize the long-term rewards of Markov Decision Processes, dynamic programming and backward induction are typically used to solve the Bellman optimality equation. However, certain real-world problems require optimization of an objective that is non-linear in cumulative rewards, making it impossible to apply dynamic programming directly. For instance, in a resource allocation problem, one of the objectives might be to maximize long-term fairness among users. We observe that when an agent aims to optimize a function of the sum of rewards, the problem loses its Markov nature. This paper addresses and formalizes the problem of optimizing a non-linear function of the long-term average of rewards. We propose model-based and model-free algorithms to learn the policy. The model-based policy is shown to achieve a regret of $\\Tilde{O}\\left(LKDS\\sqrt{\\frac{A}{T}}\\right)$ for $K$ objectives combined with a concave $L$-Lipschitz function. Furthermore, using fairness in cellular base-station scheduling and queueing system scheduling as examples, we demonstrate that the proposed algorithm significantly outperforms conventional RL approaches.
[abs]