We study the problem of full-information online learning in the "bounded recall" setting popular in the study of repeated games. An online learning algorithm $\mathcal{A}$ is $M$-$\textit{bounded-recall}$ if its output at time $t$ can be written as a function of the $M$ previous rewards (and not e.g. any other internal state of $\mathcal{A}$). We first demonstrate that a natural approach to constructing bounded-recall algorithms from mean-based no-regret learning algorithms (e.g., running Hedge over the last $M$ rounds) fails, and that any such algorithm incurs constant regret per round. We then construct a stationary bounded-recall algorithm that achieves a per-round regret of $\Theta(1/\sqrt{M})$, which we complement with a tight lower bound. Finally, we show that unlike the perfect recall setting, any low regret bound bounded-recall algorithm must be aware of the ordering of the past $M$ losses -- any bounded-recall algorithm which plays a symmetric function of the past $M$ losses must incur constant regret per round.
翻译:我们研究了在重复博弈研究中流行的"有限记忆"设置下的全信息在线学习问题。如果一个在线学习算法$\mathcal{A}$在时间$t$的输出可以表示为前$M$轮奖励的函数(而不能依赖于算法$\mathcal{A}$的任何其他内部状态),则称该算法为$M$-$\textit{有限记忆}$算法。我们首先证明,从基于平均值的无遗憾学习算法(例如在最近$M$轮上运行Hedge算法)构建有限记忆算法的自然方法会失效,且任何此类算法每轮都会产生恒定遗憾。随后我们构建了一个平稳的有限记忆算法,其每轮遗憾达到$\Theta(1/\sqrt{M})$,并给出了匹配的紧下界。最后,我们证明与完全记忆设置不同,任何具有低遗憾界的有限记忆算法都必须感知过去$M$轮损失的顺序——任何将过去$M$轮损失作为对称函数进行决策的有限记忆算法,都必然会产生每轮恒定遗憾。