Conservative Contextual Bandits (CCBs) address safety in sequential decision making by requiring that an agent's policy, along with minimizing regret, also satisfies a safety constraint: the performance is not worse than a baseline policy (e.g., the policy that the company has in production) by more than $(1+\alpha)$ factor. Prior work developed UCB-style algorithms in the multi-armed [Wu et al., 2016] and contextual linear [Kazerouni et al., 2017] settings. However, in practice the cost of the arms is often a non-linear function, and therefore existing UCB algorithms are ineffective in such settings. In this paper, we consider CCBs beyond the linear case and develop two algorithms $\mathtt{C-SquareCB}$ and $\mathtt{C-FastCB}$, using Inverse Gap Weighting (IGW) based exploration and an online regression oracle. We show that the safety constraint is satisfied with high probability and that the regret of $\mathtt{C-SquareCB}$ is sub-linear in horizon $T$, while the regret of $\mathtt{C-FastCB}$ is first-order and is sub-linear in $L^*$, the cumulative loss of the optimal policy. Subsequently, we use a neural network for function approximation and online gradient descent as the regression oracle to provide $\tilde{O}(\sqrt{KT} + K/\alpha) $ and $\tilde{O}(\sqrt{KL^*} + K (1 + 1/\alpha))$ regret bounds, respectively. Finally, we demonstrate the efficacy of our algorithms on real-world data and show that they significantly outperform the existing baseline while maintaining the performance guarantee.
翻译:保守上下文赌博机(CCBs)通过在序列决策中引入安全约束来解决安全性问题:智能体的策略在最小化遗憾的同时,还需满足性能不差于基线策略(例如公司当前生产环境中使用的策略)超过$(1+\alpha)$倍的安全约束。先前的研究已在多臂赌博机[Wu等人,2016]和上下文线性赌博机[Kazerouni等人,2017]场景中开发了基于UCB风格的算法。然而,在实际应用中,臂的成本通常是非线性函数,因此现有的UCB算法在此类场景中效果有限。本文研究了超越线性情况的CCBs,并基于逆间隔加权(IGW)探索策略和在线回归预言机,提出了两种算法$\mathtt{C-SquareCB}$和$\mathtt{C-FastCB}$。我们证明安全约束以高概率得到满足,且$\mathtt{C-SquareCB}$的遗憾在时间范围$T$内是次线性的,而$\mathtt{C-FastCB}$的遗憾是一阶的且在最优策略的累积损失$L^*$内是次线性的。随后,我们采用神经网络进行函数逼近,并以在线梯度下降作为回归预言机,分别给出了$\tilde{O}(\sqrt{KT} + K/\alpha)$和$\tilde{O}(\sqrt{KL^*} + K (1 + 1/\alpha))$的遗憾界。最后,我们在真实世界数据上验证了算法的有效性,结果表明它们在保持性能保证的同时,显著优于现有基线方法。