In this paper, we provide the first investigation into adaptive combinatorial experimental design, focusing on the trade-off between regret minimization and statistical power in combinatorial multi-armed bandits (CMAB). While minimizing regret requires repeated exploitation of high-reward arms, accurate inference on reward gaps requires sufficient exploration of suboptimal actions. We formalize this trade-off through the concept of Pareto optimality and establish equivalent conditions for Pareto-efficient learning in CMAB. We consider two relevant cases under different information structures, i.e., full-bandit feedback and semi-bandit feedback, and propose two algorithms MixCombKL and MixCombUCB respectively for these two cases. We provide theoretical guarantees showing that both algorithms are Pareto optimal, achieving finite-time guarantees on both regret and estimation error of arm gaps. Our results further reveal that richer feedback significantly tightens the attainable Pareto frontier, with the primary gains arising from improved estimation accuracy under our proposed methods. Taken together, these findings establish a principled framework for adaptive combinatorial experimentation in multi-objective decision-making.
翻译:本文首次对自适应组合实验设计展开研究,重点关注组合多臂老虎机(CMAB)中遗憾最小化与统计功效之间的权衡。最小化遗憾需要重复利用高奖励臂,而对奖励差距的准确推断则需要充分探索次优动作。我们通过帕累托最优性概念形式化这一权衡,并建立CMAB中帕累托高效学习的等价条件。我们考虑不同信息结构下的两种相关情形,即全老虎机反馈与半老虎机反馈,并分别提出适用于这两种情形的MixCombKL与MixCombUCB算法。我们提供的理论保证表明,两种算法均具有帕累托最优性,能在遗憾与臂间差距估计误差两方面同时获得有限时间保证。我们的结果进一步揭示,更丰富的反馈能显著收紧可达帕累托前沿,其主要增益源于所提方法下估计精度的提升。综上所述,这些发现为多目标决策中的自适应组合实验建立了一个原则性框架。