We consider the problem of reward maximization in the dueling bandit setup along with constraints on resource consumption. As in the classic dueling bandits, at each round the learner has to choose a pair of items from a set of $K$ items and observe a relative feedback for the current pair. Additionally, for both items, the learner also observes a vector of resource consumptions. The objective of the learner is to maximize the cumulative reward, while ensuring that the total consumption of any resource is within the allocated budget. We show that due to the relative nature of the feedback, the problem is more difficult than its bandit counterpart and that without further assumptions the problem is not learnable from a regret minimization perspective. Thereafter, by exploiting assumptions on the available budget, we provide an EXP3 based dueling algorithm that also considers the associated consumptions and show that it achieves an $\tilde{\mathcal{O}}\left({\frac{OPT^{(b)}}{B}}K^{1/3}T^{2/3}\right)$ regret, where $OPT^{(b)}$ is the optimal value and $B$ is the available budget. Finally, we provide numerical simulations to demonstrate the efficacy of our proposed method.
翻译:我们研究了在擂主对决(dueling bandit)设定下,结合资源消耗约束的奖励最大化问题。与经典擂主对决问题类似,在每一轮中,学习者需从包含$K$个选项的集合中选择一对项目,并观察当前这对项目的相对反馈。此外,对于这两个项目,学习者还会观察到一组资源消耗向量。学习者的目标是最大化累积奖励,同时确保任何资源的总消耗量不超过分配的预算。我们证明,由于反馈的相对性,该问题比其对应的老虎机(bandit)问题更为困难,并且在没有进一步假设的情况下,从遗憾最小化的角度来看该问题不可学习。随后,通过利用关于可用预算的假设,我们提出了一种基于EXP3的擂主对决算法,该算法同时考虑了相关的资源消耗,并证明该算法能达到$\tilde{\mathcal{O}}\left({\frac{OPT^{(b)}}{B}}K^{1/3}T^{2/3}\right)$的遗憾值,其中$OPT^{(b)}$为最优值,$B$为可用预算。最后,我们通过数值仿真验证了所提方法的有效性。