We develop a general theory to optimize the frequentist regret for sequential learning problems, where efficient bandit and reinforcement learning algorithms can be derived from unified Bayesian principles. We propose a novel optimization approach to generate "algorithmic beliefs" at each round, and use Bayesian posteriors to make decisions. The optimization objective to create "algorithmic beliefs," which we term "Algorithmic Information Ratio," represents an intrinsic complexity measure that effectively characterizes the frequentist regret of any algorithm. To the best of our knowledge, this is the first systematical approach to make Bayesian-type algorithms prior-free and applicable to adversarial settings, in a generic and optimal manner. Moreover, the algorithms are simple and often efficient to implement. As a major application, we present a novel algorithm for multi-armed bandits that achieves the "best-of-all-worlds" empirical performance in the stochastic, adversarial, and non-stationary environments. And we illustrate how these principles can be used in linear bandits, bandit convex optimization, and reinforcement learning.
翻译:我们提出了一种通用理论,用于优化序列学习问题中的频率派遗憾值,该理论能够从统一的贝叶斯原理出发推导出高效的赌博机与强化学习算法。我们设计了一种新颖的优化方法,在每个时间步生成"算法信念",并利用贝叶斯后验进行决策。用于构建"算法信念"的优化目标——我们称之为"算法信息比"——是一种衡量算法固有复杂度的指标,能够有效刻画任意算法的频率派遗憾值。据我们所知,这是首个以通用且最优的方式使贝叶斯类型算法脱离先验假设、并适用于对抗性环境的系统性方法。此外,所提算法实现简单且通常高效。作为主要应用,我们提出了一种适用于多臂赌博机的新算法,在随机、对抗性和非平稳环境中均实现了"全能最优"的实证性能。我们还阐明了如何将这些原理应用于线性赌博机、赌博机凸优化及强化学习领域。