We introduce a novel multi-armed bandit framework, where each arm is associated with a fixed unknown credal set over the space of outcomes (which can be richer than just the reward). The arm-to-credal-set correspondence comes from a known class of hypotheses. We then define a notion of regret corresponding to the lower prevision defined by these credal sets. Equivalently, the setting can be regarded as a two-player zero-sum game, where, on each round, the agent chooses an arm and the adversary chooses the distribution over outcomes from a set of options associated with this arm. The regret is defined with respect to the value of game. For certain natural hypothesis classes, loosely analgous to stochastic linear bandits (which are a special case of the resulting setting), we propose an algorithm and prove a corresponding upper bound on regret. We also prove lower bounds on regret for particular special cases.
翻译:我们提出了一种新颖的多臂赌博机框架,其中每个臂与结果空间上的一个固定未知信度集相关联(其结构可能比仅包含奖励更为丰富)。臂与信度集之间的对应关系源于一个已知的假设类别。随后,我们基于这些信度集所定义的下期望,定义了相应的遗憾概念。等价地,该设定可视为一个双人零和博弈:每轮中,智能体选择一个臂,而对手则从与该臂关联的一组选项中选取结果上的分布。遗憾定义为相对于博弈值的损失。针对某些自然假设类别(其性质大致类似于随机线性赌博机,而后者是本框架的一个特例),我们提出了一种算法并证明了相应的遗憾上界。此外,对于特定特例,我们还证明了遗憾的下界。