Online learning algorithms often faces a fundamental trilemma: balancing regret guarantees between adversarial and stochastic settings and providing baseline safety against a fixed comparator. While existing methods excel in one or two of these regimes, they typically fail to unify all three without sacrificing optimal rates or requiring oracle access to problem-dependent parameters. In this work, we bridge this gap by introducing COMPASS-Hedge. Our algorithm is the first full-information method to simultaneously achieve: i) Minimax-optimal regret in adversarial environments; ii) Instance-optimal, gap-dependent regret in stochastic environments; and iii) $\tilde{\mathcal{O}}(1)$ regret relative to a designated baseline policy, up to logarithmic factors. Crucially, COMPASS-Hedge is parameter-free and requires no prior knowledge of the environment's nature or the magnitude of the stochastic sub optimality gaps. Our approach hinges on a novel integration of adaptive pseudo-regret scaling and phase-based aggression, coupled with a comparator-aware mixing strategy. To the best of our knowledge, this provides the first "best-of-three-world" guarantee in the full-information setting, establishing that baseline safety does not have to come at the cost of worst-case robustness or stochastic efficiency.
翻译:在线学习算法常面临一个基本的三难困境:在对抗性与随机性设定之间平衡遗憾保证,并提供针对固定比较器的基线安全性。尽管现有方法在其中一个或两个领域中表现优异,但它们通常无法在不牺牲最优速率或需要问题相关参数的预言机访问的情况下,统一所有三个目标。在本工作中,我们通过引入COMPASS-Hedge填补了这一空白。我们的算法是首个在全信息设定下同时实现以下所有目标的算法:i) 在对抗性环境中达到极小化最优的遗憾;ii) 在随机性环境中达到实例最优的、依赖间隔的遗憾;iii) 相对于指定基线策略,达到对数因子内$\tilde{\mathcal{O}}(1)$的遗憾。关键在于,COMPASS-Hedge无需参数且不需要任何关于环境性质或随机次优间隔大小的先验知识。我们的方法基于自适应伪遗憾缩放与基于阶段的激进策略的创新性结合,并辅以比较器感知的混合策略。据我们所知,这首次在全信息设定中提供了“三界最优”保证,证明基线安全性不必以牺牲最坏情况鲁棒性或随机效率为代价。