Are Stochastic Multi-objective Bandits Harder than Single-objective Bandits?

Multi-objective bandits have attracted increasing attention because of their broad applicability and mathematical elegance, where the reward of each arm is a multi-dimensional vector rather than a scalar. This naturally introduces Pareto order relations and Pareto regret. A long-standing question in this area is whether performance is fundamentally harder to optimize because of this added complexity. A recent surprising result shows that, in the adversarial setting, Pareto regret is no larger than classical regret; however, in the stochastic setting, where the regret notion is different, the picture remains unclear. In fact, existing work suggests that Pareto regret in the stochastic case increases with the dimensionality. This controversial yet subtle phenomenon motivates our central question: \emph{are multi-objective bandits actually harder than single-objective ones?} We answer this question in full by showing that, in the stochastic setting, Pareto regret is in fact governed by the maximum sub-optimality gap \(g^\dagger\), and hence by the minimum marginal regret of order \(Ω(\frac{K\log T}{g^\dagger})\). We further develop a new algorithm that achieves Pareto regret of order \(O(\frac{K\log T}{g^\dagger})\), and is therefore optimal. The algorithm leverages a nested two-layer uncertainty quantification over both arms and objectives through upper and lower confidence bound estimators. It combines a top-two racing strategy for arm selection with an uncertainty-greedy rule for dimension selection. Together, these components balance exploration and exploitation across the two layers. We also conduct comprehensive numerical experiments to validate the proposed algorithm, showing the desired regret guarantee and significant gains over benchmark methods.

翻译：多目标老虎机因其广泛适用性和数学优雅性而日益受到关注，其中每个臂的奖励是多维向量而非标量。这自然引入了帕累托序关系和帕累托遗憾。该领域的一个长期问题是，这种额外复杂性是否从根本上使得性能优化更具挑战性。最近一个令人惊讶的结果表明，在对抗性环境中，帕累托遗憾并不大于经典遗憾；然而，在随机环境中（遗憾定义不同），情况仍不明确。事实上，现有研究表明随机情形下的帕累托遗憾会随维度增加而增大。这一存在争议且微妙的现象激发了我们核心问题的思考：\emph{多目标老虎机是否真的比单目标老虎机更难？}我们通过完整回答这个问题，证明在随机设置中，帕累托遗憾实际上由最大次优间隙 \(g^\dagger\) 主导，因此最小边际遗憾阶数为 \(\Omega(\frac{K\log T}{g^\dagger})\)。我们进一步开发了一种新算法，实现阶数为 \(O(\frac{K\log T}{g^\dagger})\) 的帕累托遗憾，因而达到最优。该算法利用嵌套双层不确定性量化机制，通过上下置信界估计器同时覆盖臂和目标维度。它将用于臂选择的top-two竞速策略与用于维度选择的不确定性贪婪规则相结合。这些组件共同平衡了两层之间的探索与利用。我们还进行了全面的数值实验验证所提出算法，展示了理想的遗憾保证及相对于基准方法的显著性能提升。