We introduce the safe best-arm identification framework with linear feedback, where the agent is subject to some stage-wise safety constraint that linearly depends on an unknown parameter vector. The agent must take actions in a conservative way so as to ensure that the safety constraint is not violated with high probability at each round. Ways of leveraging the linear structure for ensuring safety has been studied for regret minimization, but not for best-arm identification to the best our knowledge. We propose a gap-based algorithm that achieves meaningful sample complexity while ensuring the stage-wise safety. We show that we pay an extra term in the sample complexity due to the forced exploration phase incurred by the additional safety constraint. Experimental illustrations are provided to justify the design of our algorithm.
翻译:我们提出了带线性反馈的安全最优臂识别框架,其中智能体需满足与未知参数向量线性相关的阶段性安全约束。智能体必须以保守方式采取行动,以确保每轮中安全约束以高概率不被违反。利用线性结构保障安全性的方法已在遗憾最小化问题中得到研究,但据我们所知尚未应用于最优臂识别。我们提出了一种基于间隙的算法,在保证阶段性安全的同时实现了有意义的样本复杂度。研究表明,由于额外安全约束导致的强制探索阶段,样本复杂度中会额外增加一项代价。通过实验验证了算法设计的合理性。