Online convex optimization (OCO) is a widely used framework in online learning. In each round, the learner chooses a decision in a convex set and an adversary chooses a convex loss function, and then the learner suffers the loss associated with their current decision. However, in many applications the learner's loss depends not only on the current decision but on the entire history of decisions until that point. The OCO framework and its existing generalizations do not capture this, and they can only be applied to many settings of interest after a long series of approximation arguments. They also leave open the question of whether the dependence on memory is tight because there are no non-trivial lower bounds. In this work we introduce a generalization of the OCO framework, ``Online Convex Optimization with Unbounded Memory'', that captures long-term dependence on past decisions. We introduce the notion of $p$-effective memory capacity, $H_p$, that quantifies the maximum influence of past decisions on present losses. We prove an $O(\sqrt{H_p T})$ upper bound on the policy regret and a matching (worst-case) lower bound. As a special case, we prove the first non-trivial lower bound for OCO with finite memory~\citep{anavaHM2015online}, which could be of independent interest, and also improve existing upper bounds. We demonstrate the broad applicability of our framework by using it to derive regret bounds, and to improve and simplify existing regret bound derivations, for a variety of online learning problems including online linear control and an online variant of performative prediction.
翻译:在线凸优化(OCO)是在线学习中广泛使用的框架。在每一轮中,学习者在凸集中选择一个决策,而对手选择一个凸损失函数,随后学习者承受与当前决策相关的损失。然而,在许多应用中,学习者的损失不仅取决于当前决策,还取决于截至该时刻的全部历史决策。OCO框架及其现有泛化未能捕捉这一特性,并且它们仅能通过一系列近似论证应用于许多感兴趣的设置。此外,由于缺乏非平凡的下界,关于对记忆依赖的紧致性问题仍悬而未决。在本工作中,我们引入OCO框架的一个泛化——“在线凸优化与无界记忆”,该框架捕捉了对过去决策的长期依赖。我们提出$p$-有效记忆容量$H_p$的概念,该指标量化了过去决策对当前损失的最大影响。我们证明了策略遗憾的$O(\sqrt{H_p T})$上界以及匹配的(最坏情况)下界。作为一个特例,我们首次证明了有限记忆OCO的非平凡下界~\citep{anavaHM2015online},该结果可能具有独立研究价值,同时改进了现有上界。我们通过将该框架应用于多种在线学习问题(包括在线线性控制和在线变体的表演预测)来推导遗憾界、改进并简化现有遗憾界推导,展示了该框架的广泛适用性。