We study how to learn $\epsilon$-optimal strategies in zero-sum imperfect information games (IIG) with trajectory feedback. In this setting, players update their policies sequentially based on their observations over a fixed number of episodes, denoted by $T$. Existing procedures suffer from high variance due to the use of importance sampling over sequences of actions (Steinberger et al., 2020; McAleer et al., 2022). To reduce this variance, we consider a fixed sampling approach, where players still update their policies over time, but with observations obtained through a given fixed sampling policy. Our approach is based on an adaptive Online Mirror Descent (OMD) algorithm that applies OMD locally to each information set, using individually decreasing learning rates and a regularized loss. We show that this approach guarantees a convergence rate of $\tilde{\mathcal{O}}(T^{-1/2})$ with high probability and has a near-optimal dependence on the game parameters when applied with the best theoretical choices of learning rates and sampling policies. To achieve these results, we generalize the notion of OMD stabilization, allowing for time-varying regularization with convex increments.
翻译:我们研究在具有轨迹反馈的零和不完美信息博弈(IIG)中如何学习$\epsilon$-最优策略。在这种设置下,玩家基于固定轮数(记为$T$)内的观测结果顺序更新其策略。现有方法由于对动作序列采用重要性采样(Steinberger等人,2020;McAleer等人,2022),存在高方差问题。为降低此方差,我们考虑一种固定采样方法:玩家仍随时间更新策略,但通过给定的固定采样策略获取观测结果。我们的方法基于自适应在线镜像下降(OMD)算法,该算法将OMD局部应用于每个信息集,使用逐次递减的学习率和正则化损失。研究表明,该方法能以高概率保证$\tilde{\mathcal{O}}(T^{-1/2})$的收敛速率,且当采用理论最优的学习率和采样策略时,对博弈参数具有近乎最优的依赖性。为实现这些结果,我们推广了OMD稳定化的概念,允许使用凸增量进行时变正则化。