We study Pareto optimality in multi-objective multi-armed bandit by providing a formulation of adversarial multi-objective multi-armed bandit and defining its Pareto regrets that can be applied to both stochastic and adversarial settings. The regrets do not rely on any scalarization functions and reflect Pareto optimality compared to scalarized regrets. We also present new algorithms assuming both with and without prior information of the multi-objective multi-armed bandit setting. The algorithms are shown optimal in adversarial settings and nearly optimal up to a logarithmic factor in stochastic settings simultaneously by our established upper bounds and lower bounds on Pareto regrets. Moreover, the lower bound analyses show that the new regrets are consistent with the existing Pareto regret for stochastic settings and extend an adversarial attack mechanism from bandit to the multi-objective one.
翻译:我们通过提出对抗性多目标多臂老虎机的形式定义,并定义可同时适用于随机与对抗性场景的帕累托遗憾,对多目标多臂老虎机中的帕累托最优性进行了研究。该遗憾无需依赖任何标量化函数,相比标量化遗憾更能反映帕累托最优性。我们还提出了新算法,涵盖具备与不具备多目标多臂老虎机先验信息两种情况。通过所建立的帕累托遗憾上界与下界,我们证明了这些算法在对抗性场景中达到最优,在随机场景中则达到仅含对数因子的近似最优。此外,下界分析表明,新遗憾与现有随机场景下的帕累托遗憾一致,并将老虎机中的对抗性攻击机制扩展至多目标场景。