We consider online no-regret learning in unknown games with bandit feedback, where each player can only observe its reward at each time -- determined by all players' current joint action -- rather than its gradient. We focus on the class of \textit{smooth and strongly monotone} games and study optimal no-regret learning therein. Leveraging self-concordant barrier functions, we first construct a new bandit learning algorithm and show that it achieves the single-agent optimal regret of $\tilde{\Theta}(n\sqrt{T})$ under smooth and strongly concave reward functions ($n \geq 1$ is the problem dimension). We then show that if each player applies this no-regret learning algorithm in strongly monotone games, the joint action converges in the \textit{last iterate} to the unique Nash equilibrium at a rate of $\tilde{\Theta}(nT^{-1/2})$. Prior to our work, the best-known convergence rate in the same class of games is $\tilde{O}(n^{2/3}T^{-1/3})$ (achieved by a different algorithm), thus leaving open the problem of optimal no-regret learning algorithms (since the known lower bound is $\Omega(nT^{-1/2})$). Our results thus settle this open problem and contribute to the broad landscape of bandit game-theoretical learning by identifying the first doubly optimal bandit learning algorithm, in that it achieves (up to log factors) both optimal regret in the single-agent learning and optimal last-iterate convergence rate in the multi-agent learning. We also present preliminary numerical results on several application problems to demonstrate the efficacy of our algorithm in terms of iteration count.
翻译:我们考虑未知博弈中具有赌博机反馈的在线无遗憾学习,其中每个智能体在每个时刻只能观测到其奖励(由所有智能体的当前联合动作决定),而非梯度。我们聚焦于\textit{光滑且强单调}博弈的类别,并研究其中的最优无遗憾学习。利用自和谐障碍函数,我们首先构建了一种新的赌博机学习算法,并证明其在光滑且强凹的奖励函数下($n \geq 1$为问题维度)实现了单智能体最优遗憾$\tilde{\Theta}(n\sqrt{T})$。随后证明,若每个智能体在强单调博弈中应用该无遗憾学习算法,则联合动作在\textit{最后迭代}处以$\tilde{\Theta}(nT^{-1/2})$的速率收敛至唯一纳什均衡。在我们工作之前,同类博弈中已知的最优收敛速率为$\tilde{O}(n^{2/3}T^{-1/3})$(由另一算法实现),因此最优无遗憾学习算法问题仍未解决(已知下界为$\Omega(nT^{-1/2})$)。我们的结果解决了这一开放问题,并通过对首个双重最优赌博机学习算法的识别(该算法在单智能体学习中实现(对数因子内)最优遗憾,并在多智能体学习中实现最优最后迭代收敛速率),为赌博机博弈理论学习领域做出贡献。我们还展示了若干应用问题的初步数值结果,以验证算法在迭代次数方面的有效性。