We revisit the classic problem of optimal subset selection in the online learning set-up. Assume that the set $[N]$ consists of $N$ distinct elements. On the $t$th round, an adversary chooses a monotone reward function $f_t: 2^{[N]} \to \mathbb{R}_+$ that assigns a non-negative reward to each subset of $[N].$ An online policy selects (perhaps randomly) a subset $S_t \subseteq [N]$ consisting of $k$ elements before the reward function $f_t$ for the $t$th round is revealed to the learner. As a consequence of its choice, the policy receives a reward of $f_t(S_t)$ on the $t$th round. Our goal is to design an online sequential subset selection policy to maximize the expected cumulative reward accumulated over a time horizon. In this connection, we propose an online learning policy called SCore (Subset Selection with Core) that solves the problem for a large class of reward functions. The proposed SCore policy is based on a new polyhedral characterization of the reward functions called $\alpha$-Core - a generalization of Core from the cooperative game theory literature. We establish a learning guarantee for the SCore policy in terms of a new performance metric called $\alpha$-augmented regret. In this new metric, the performance of the online policy is compared with an unrestricted offline benchmark that can select all $N$ elements at every round. We show that a large class of reward functions, including submodular, can be efficiently optimized with the SCore policy. We also extend the proposed policy to the optimistic learning set-up where the learner has access to additional untrusted hints regarding the reward functions. Finally, we conclude the paper with a list of open problems.
翻译:我们重新审视在线学习框架下的经典最优子集选择问题。假设集合$[N]$由$N$个不同元素组成。在第$t$轮中,对手选择一个单调奖励函数$f_t: 2^{[N]} \to \mathbb{R}_+$,该函数为$[N]$的每个子集分配非负奖励。在线策略在第$t$轮奖励函数$f_t$揭示给学习器之前,选择(可能随机)一个包含$k$个元素的子集$S_t \subseteq [N]$。该策略因其选择而在第$t$轮获得奖励$f_t(S_t)$。我们的目标是设计一个在线序贯子集选择策略,以最大化在时间范围内累积的期望累积奖励。为此,我们提出了一种名为SCore(基于核的子集选择)的在线学习策略,该策略能够解决一大类奖励函数的问题。所提出的SCore策略基于一种称为α-核的奖励函数多面体特征——这是合作博弈论文献中核概念的推广。我们通过一种称为α-增广遗憾的新性能指标,为SCore策略建立了学习保证。在该新指标中,在线策略的性能与一个无限制的离线基准进行比较,该基准可以在每一轮中选择所有$N$个元素。我们证明,包括子模函数在内的一大类奖励函数,可以通过SCore策略高效优化。我们还将该策略扩展到乐观学习框架,其中学习器可以访问关于奖励函数的额外不可信提示。最后,我们以一系列开放问题结束本文。