Pure exploration in bandits formalises multiple real-world problems, such as tuning hyper-parameters or conducting user studies to test a set of items, where different safety, resource, and fairness constraints on the decision space naturally appear. We study these problems as pure exploration in multi-armed bandits with unknown linear constraints, where the aim is to identify an $r$-optimal and feasible policy as fast as possible with a given level of confidence. First, we propose a Lagrangian relaxation of the sample complexity lower bound for pure exploration under constraints. Second, we leverage properties of convex optimisation in the Lagrangian lower bound to propose two computationally efficient extensions of Track-and-Stop and Gamified Explorer, namely LATS and LAGEX. Then, we propose a constraint-adaptive stopping rule, and while tracking the lower bound, use optimistic estimate of the feasible set at each step. We show that LAGEX achieves asymptotically optimal sample complexity upper bound, while LATS shows asymptotic optimality up to novel constraint-dependent constants. Finally, we conduct numerical experiments with different reward distributions and constraints that validate efficient performance of LATS and LAGEX.
翻译:赌博机中的纯探索形式化地描述了多个现实世界问题,例如调整超参数或进行用户研究以测试一组项目,其中决策空间上自然存在不同的安全性、资源与公平性约束。我们将这些问题作为具有未知线性约束的多臂赌博机中的纯探索进行研究,其目标是在给定置信水平下尽可能快地识别出一个 $r$ 最优且可行的策略。首先,我们提出了约束下纯探索的样本复杂度下界的拉格朗日松弛。其次,我们利用拉格朗日下界中凸优化的性质,提出了 Track-and-Stop 和 Gamified Explorer 的两种计算高效扩展,即 LATS 和 LAGEX。接着,我们提出了一种约束自适应的停止规则,并在跟踪下界的同时,在每一步使用可行集的乐观估计。我们证明 LAGEX 实现了渐近最优的样本复杂度上界,而 LATS 则展现出直至新颖的约束相关常数的渐近最优性。最后,我们使用不同的奖励分布和约束进行了数值实验,验证了 LATS 和 LAGEX 的高效性能。