Online Learning and Equilibrium Computation with Ranking Feedback

Online learning in arbitrary, and possibly adversarial, environments has been extensively studied in sequential decision-making, and it is closely connected to equilibrium computation in game theory. Most existing online learning algorithms rely on \emph{numeric} utility feedback from the environment, which may be unavailable in human-in-the-loop applications and/or may be restricted by privacy concerns. In this paper, we study an online learning model in which the learner only observes a \emph{ranking} over a set of proposed actions at each timestep. We consider two ranking mechanisms: rankings induced by the \emph{instantaneous} utility at the current timestep, and rankings induced by the \emph{time-average} utility up to the current timestep, under both \emph{full-information} and \emph{bandit} feedback settings. Using the standard external-regret metric, we show that sublinear regret is impossible with instantaneous-utility ranking feedback in general. Moreover, when the ranking model is relatively deterministic, \emph{i.e.}, under the Plackett-Luce model with a temperature that is sufficiently small, sublinear regret is also impossible with time-average utility ranking feedback. We then develop new algorithms that achieve sublinear regret under the additional assumption that the utility sequence has sublinear total variation. Notably, for full-information time-average utility ranking feedback, this additional assumption can be removed. As a consequence, when all players in a normal-form game follow our algorithms, repeated play yields an approximate coarse correlated equilibrium. We also demonstrate the effectiveness of our algorithms in an online large-language-model routing task.

翻译：在任意（可能对抗性）环境中进行在线学习的问题在序贯决策中已得到广泛研究，且与博弈论中的均衡计算密切相关。现有在线学习算法大多依赖于环境反馈的数值效用，但在人机交互应用中此类反馈可能难以获取，或受到隐私限制。本文研究一种在线学习模型，其中学习者在每个时间步仅能观测到一组提议动作的排序信息。我们考虑两种排序机制：由当前时刻瞬时效用诱导的排序，以及由截至当前时刻的时间平均效用诱导的排序，同时覆盖完全信息和赌博机两种反馈设定。基于标准的外部遗憾指标，我们证明在瞬时效用排序反馈下，次线性遗憾通常无法实现。此外，当排序模型相对确定时（即Plackett-Luce模型中温度参数足够小），时间平均效用排序反馈同样无法实现次线性遗憾。我们随后开发了新算法，在效用序列具有次线性总变差的附加假设下实现了次线性遗憾。值得注意的是，对于完全信息下的时间平均效用排序反馈，该附加假设可被移除。据此，当正则形式博弈中所有玩家均遵循我们的算法时，重复博弈可产生近似粗相关均衡。我们还在在线大语言模型路由任务中验证了算法的有效性。

相关内容

排序

关注 313

排序是计算机内经常进行的一种操作，其目的是将一组“无序”的记录序列调整为“有序”的记录序列。分内部排序和外部排序。若整个排序过程不需要访问外存便能完成，则称此类排序问题为内部排序。反之，若参加排序的记录数量很大，整个序列的排序过程不可能在内存中完成，则称此类排序问题为外部排序。内部排序的过程是一个逐步扩大记录的有序序列长度的过程。

【普林斯顿博士论文】在线学习：优化、控制与学习理论

专知会员服务

31+阅读 · 2025年10月19日

《战略智能体与有限反馈下的序贯决策》211页

专知会员服务

37+阅读 · 2025年5月7日

【CMU博士论文】机器学习的基础: 民有，民享, 300页pdf阐述算法博弈论

专知会员服务

37+阅读 · 2023年1月3日

长综述《用于随机控制和博弈的机器学习方法最新发展》2022最新76页长论文，加州大学、上海纽约大学等

专知会员服务

47+阅读 · 2022年9月29日