We consider online learning problems under a partial observability model capturing situations where the information conveyed to the learner is between full information and bandit feedback. In the simplest variant, we assume that in addition to its own loss, the learner also gets to observe losses of some other actions. The revealed losses depend on the learner's action and a directed observation system chosen by the environment. For this setting, we propose the first algorithm that enjoys near-optimal regret guarantees without having to know the observation system before selecting its actions. Along similar lines, we also define a new partial information setting that models online combinatorial optimization problems where the feedback received by the learner is between semi-bandit and full feedback. As the predictions of our first algorithm cannot be always computed efficiently in this setting, we propose another algorithm with similar properties and with the benefit of always being computationally efficient, at the price of a slightly more complicated tuning mechanism. Both algorithms rely on a novel exploration strategy called implicit exploration, which is shown to be more efficient both computationally and information-theoretically than previously studied exploration strategies for the problem.
翻译:我们考虑部分可观测模型下的在线学习问题,该模型捕捉了学习者获取的信息介于全信息与赌博机反馈之间的情形。在最简变体中,我们假设学习者除自身损失外,还能观测到其他某些动作的损失。这些被揭示的损失取决于学习者的动作以及环境选择的定向观测系统。针对这一设定,我们提出了首个无需在选定动作前知晓观测系统即可实现近乎最优遗憾保证的算法。类似地,我们还定义了一种新的部分信息设定,用于建模学习者接收的反馈介于半赌博机与全反馈之间的在线组合优化问题。由于在该设定下首个算法的预测无法始终高效计算,我们提出了另一个具有相似特性且始终具备计算效率优势的算法,其代价是需要稍复杂的调参机制。两种算法均依赖于一种名为"隐性探索"的新型探索策略,该策略在计算效率和信息论效率上均优于先前研究的探索策略。