We consider a safe optimization problem with bandit feedback in which an agent sequentially chooses actions and observes responses from the environment, with the goal of maximizing an arbitrary function of the response while respecting stage-wise constraints. We propose an algorithm for this problem, and study how the geometric properties of the constraint set impact the regret of the algorithm. In order to do so, we introduce the notion of the sharpness of a particular constraint set, which characterizes the difficulty of performing learning within the constraint set in an uncertain setting. This concept of sharpness allows us to identify the class of constraint sets for which the proposed algorithm is guaranteed to enjoy sublinear regret. Simulation results for this algorithm support the sublinear regret bound and provide empirical evidence that the sharpness of the constraint set impacts the performance of the algorithm.
翻译:我们考虑一类带强盗反馈的安全优化问题:智能体依次选择动作并观察环境响应,目标是在满足逐阶段约束的同时最大化响应的任意函数。针对此问题,我们提出一种算法,并研究约束集的几何特性如何影响该算法的遗憾值。为此,我们引入特定约束集“尖锐度”的概念,该概念刻画了在不确定环境下在约束集中进行学习的难度。这种尖锐度概念使我们能够识别出一类约束集,对于这些约束集,所提算法可保证具有次线性遗憾。该算法的仿真结果支持次线性遗憾界,并为约束集的尖锐度影响算法性能提供了经验证据。