In this work, we introduce a new version of the best-arm identification problem called best-arm identification under mediators’ feedback (BAI-MF). In traditional BAI problems, the goal is to find the arm with the highest expected reward. However, this framework does not accurately represent certain decision-making problems, such as off-policy learning and partially controllable environments.

In BAI-MF, the learner has access to a group of mediators that select arms on behalf of the agent based on their own stochastic policies. The mediators then communicate the chosen arm and observed reward back to the agent. The objective is for the agent to sequentially select mediators in order to identify the optimal arm with high probability and minimize the sample complexity.

To address this problem, we first establish a statistical lower bound on the sample complexity specific to the mediator feedback scenario. We then propose a sequential decision-making strategy assuming that the learner knows the mediators’ policies. Our algorithm matches the lower bound both almost surely and in expectation.

Furthermore, we extend our results to cases where the mediators’ policies are unknown to the learner. Despite the lack of knowledge, our algorithm still achieves comparable results.

Overall, our work presents a novel approach to best-arm identification by incorporating mediators’ feedback, providing insights into decision-making problems beyond the traditional framework.