Safe Linear Bandits over Unknown Polytopes

The safe linear bandit problem (SLB) is an online approach to linear programming with unknown objective and unknown roundwise constraints, under stochastic bandit feedback of rewards and safety risks of actions. We study the tradeoffs between efficacy and smooth safety costs of SLBs over polytopes, and the role of aggressive doubly-optimistic play in avoiding the strong assumptions made by extant pessimistic-optimistic approaches. We first elucidate an inherent hardness in SLBs due the lack of knowledge of constraints: there exist `easy' instances, for which suboptimal extreme points have large `gaps', but on which SLB methods must still incur $\Omega(\sqrt{T})$ regret or safety violations, due to an inability to resolve unknown optima to arbitrary precision. We then analyse a natural doubly-optimistic strategy for the safe linear bandit problem, DOSS, which uses optimistic estimates of both reward and safety risks to select actions, and show that despite the lack of knowledge of constraints or feasible points, DOSS simultaneously obtains tight instance-dependent $O(\log^2 T)$ bounds on efficacy regret, and $\tilde O(\sqrt{T})$ bounds on safety violations. Further, when safety is demanded to a finite precision, violations improve to $O(\log^2 T).$ These results rely on a novel dual analysis of linear bandits: we argue that \algoname proceeds by activating noisy versions of at least $d$ constraints in each round, which allows us to separately analyse rounds where a `poor' set of constraints is activated, and rounds where `good' sets of constraints are activated. The costs in the former are controlled to $O(\log^2 T)$ by developing new dual notions of gaps, based on global sensitivity analyses of linear programs, that quantify the suboptimality of each such set of constraints. The latter costs are controlled to $O(1)$ by explicitly analysing the solutions of optimistic play.

翻译：安全线性赌博机问题（SLB）是一种在线方法，用于处理目标函数未知且每轮约束未知的线性规划问题，在奖励和行动安全风险的随机赌博机反馈下进行研究。我们探讨了SLB在多面体上效能与平滑安全成本之间的权衡，以及激进的双重乐观策略在避免现有悲观-乐观方法所需强假设方面的作用。首先，我们阐明了SLB由于约束未知而固有的困难性：存在“简单”实例，其中次优极值点具有较大的“间隙”，但SLB方法在这些实例上仍必须承受$\Omega(\sqrt{T})$的遗憾或安全违规，因为无法将未知最优解解析到任意精度。随后，我们分析了安全线性赌博机问题的一种自然双重乐观策略DOSS，该策略使用奖励和安全风险的乐观估计来选择行动，并证明尽管缺乏对约束或可行点的先验知识，DOSS能同时获得紧致的实例相关效能遗憾上界$O(\log^2 T)$和安全违规上界$\tilde O(\sqrt{T})$。此外，当安全要求限定在有限精度时，违规可改进至$O(\log^2 T)$。这些结果依赖于一种新颖的线性赌博机对偶分析：我们论证\algoname 在每轮中通过激活至少$d$个约束的噪声版本进行决策，这使得我们可以分别分析激活“不良”约束集合的轮次与激活“优良”约束集合的轮次。前者的成本通过基于线性规划全局敏感性分析提出的新型对偶间隙概念被控制在$O(\log^2 T)$，这些间隙量化了每个此类约束集合的次优性。后者的成本则通过显式分析乐观策略的解被控制在$O(1)$。