We study online learning in finite-horizon episodic Markov decision processes (MDPs) under the challenging aggregate bandit feedback model, where the learner observes only the cumulative loss incurred in each episode, rather than individual losses at each state-action pair. While prior work in this setting has focused exclusively on worst-case analysis, we initiate the study of best-of-both-worlds (BOBW) algorithms that achieve low regret in both stochastic and adversarial environments. We propose the first BOBW algorithms for episodic tabular MDPs with aggregate bandit feedback. In the case of known transitions, our algorithms achieve $O(\log T)$ regret in stochastic settings and ${O}(\sqrt{T})$ regret in adversarial ones. Importantly, we also establish matching lower bounds, showing the optimality of our algorithms in this setting. We further extend our approach to unknown-transition settings by incorporating confidence-based techniques. Our results rely on a combination of FTRL over occupancy measures, self-bounding techniques, and new loss estimators inspired by recent advances in online shortest path problems. Along the way, we also provide the first individual-gap-dependent lower bounds and demonstrate near-optimal BOBW algorithms for shortest path problems with bandit feedback.
翻译:我们在具有挑战性的聚合老虎机反馈模型下,研究有限时域片段式马尔可夫决策过程(MDPs)中的在线学习问题。在该模型中,学习者仅观测每个片段中累积的损失,而非每个状态-动作对上的个体损失。尽管先前研究主要集中于最坏情况分析,我们首次探讨了在随机和对抗环境中均能实现低遗憾的“两全其美”(BOBW)算法。针对具有聚合老虎机反馈的片段式表格化MDPs,我们提出了首个BOBW算法。在已知状态转移的情况下,我们的算法在随机环境中达到$O(\\log T)$遗憾,在对抗环境中达到${O}(\\sqrt{T})$遗憾。重要的是,我们还建立了匹配的下界,证明了算法在此设定下的最优性。通过引入基于置信度的技术,我们进一步将方法扩展到未知状态转移的场景。我们的结果依赖于对占用测度进行跟随正则化领导者(FTRL)优化、自边界技术,以及受在线最短路径问题最新进展启发的新型损失估计器。在此过程中,我们还首次提供了个体间隙依赖的下界,并展示了针对老虎机反馈最短路径问题的近最优BOBW算法。