The safe linear bandit problem (SLB) is an online approach to linear programming with unknown objective and unknown round-wise constraints, under stochastic bandit feedback of rewards and safety risks of actions. We study aggressive \emph{doubly-optimistic play} in SLBs, and their role in avoiding the strong assumptions and poor efficacy associated with extant pessimistic-optimistic solutions. We first elucidate an inherent hardness in SLBs due the lack of knowledge of constraints: there exist `easy' instances, for which suboptimal extreme points have large `gaps', but on which SLB methods must still incur $\Omega(\sqrt{T})$ regret and safety violations due to an inability to refine the location of optimal actions to arbitrary precision. In a positive direction, we propose and analyse a doubly-optimistic confidence-bound based strategy for the safe linear bandit problem, DOSLB, which exploits supreme optimism by using optimistic estimates of both reward and safety risks to select actions. Using a novel dual analysis, we show that despite the lack of knowledge of constraints, DOSLB rarely takes overly risky actions, and obtains tight instance-dependent $O(\log^2 T)$ bounds on both efficacy regret and net safety violations up to any finite precision, thus yielding large efficacy gains at a small safety cost and without strong assumptions. Concretely, we argue that algorithm activates noisy versions of an `optimal' set of constraints at each round, and activation of suboptimal sets of constraints is limited by the larger of a safety and efficacy gap we define.
翻译:安全线性赌博机问题(SLB)是一种在线线性规划方法,其中目标函数和逐轮约束均未知,且通过随机赌博机反馈获得动作的奖励和安全风险。我们研究了SLB中的激进“双乐观策略”,以及其在避免现有悲观-乐观方案所需的强假设和低效性方面的作用。首先,我们阐明了由于约束知识缺失导致的SLB内在困难性:存在“简单”实例,其中次优极值点具有较大“间隙”,但SLB方法仍需因无法将最优动作位置精确调整至任意精度而产生$\Omega(\sqrt{T})$的遗憾和安全违规。从积极方向出发,我们提出并分析了一种基于双乐观置信界的安全线性赌博机策略DOSLB,该策略通过同时使用奖励和安全风险的乐观估计来选择动作,从而利用极度乐观。利用一种新颖的对偶分析,我们证明:尽管缺乏约束知识,DOSLB很少采取过度冒险的动作,并在有限精度下对效能遗憾和净安全违规均获得紧的实例依赖$O(\log^2 T)$界,从而以较小的安全代价且无需强假设实现显著的效能提升。具体而言,我们论证该算法在每一轮激活一个“最优”约束集的含噪版本,而次优约束集的激活受到我们定义的安全间隙和效能间隙中较大者的限制。