We focus on studying the learning of $\\epsilon$-optimal strategies in zero-sum imperfect information games (IIG) with trajectory feedback. In this scenario, players update their policies sequentially based on their observations throughout a fixed number of episodes, denoted as $T$. However, existing methods suffer from high variance due to the utilization of importance sampling over sequences of actions (Steinberger et al., 2020; McAleer et al., 2022). To address this issue, we propose a fixed sampling approach where players still update their policies over time, but with observations obtained through a predetermined fixed sampling policy.
Our approach is based on an adaptive Online Mirror Descent (OMD) algorithm that applies OMD locally to each information set. This involves using individually decreasing learning rates and a regularized loss. By employing this approach, we demonstrate that the convergence rate is guaranteed to be $\\tilde{\\mathcal{O}}(T^{-1/2})$ with high probability. Moreover, when combined with the best theoretical choices of learning rates and sampling policies, our approach exhibits a near-optimal dependence on the game parameters. To achieve these results, we extend the concept of OMD stabilization, allowing for time-varying regularization with convex increments.