We consider Contextual Bandits with Concave Rewards (CBCR), a multi-objective bandit problem where the desired trade-off between the rewards is defined by a known concave objective function, and the reward vector depends on an observed stochastic context. We present the first algorithm with provably vanishing regret for CBCR without restrictions on the policy space, whereas prior works were restricted to finite policy spaces or tabular representations. Our solution is based on a geometric interpretation of CBCR algorithms as optimization algorithms over the convex set of expected rewards spanned by all stochastic policies. Building on Frank-Wolfe analyses in constrained convex optimization, we derive a novel reduction from the CBCR regret to the regret of a scalar-reward bandit problem. We illustrate how to apply the reduction off-the-shelf to obtain algorithms for CBCR with both linear and general reward functions, in the case of non-combinatorial actions. Motivated by fairness in recommendation, we describe a special case of CBCR with rankings and fairness-aware objectives, leading to the first algorithm with regret guarantees for contextual combinatorial bandits with fairness of exposure.
翻译:我们研究了具有凹奖励的情境赌博机(CBCR),这是一个多目标赌博机问题,其中奖励之间期望的权衡由已知的凹目标函数定义,且奖励向量依赖于观测到的随机情境。我们首次提出了在无策略空间限制条件下,CBCR具有可证明的渐近消失遗憾的算法,而先前的工作仅限于有限策略空间或表格化表征。我们的解决方案基于对CBCR算法的几何解释,即将其视为在所有随机策略所张成的期望奖励凸集上的优化算法。基于约束凸优化中的Frank-Wolfe分析,我们推导出从CBCR遗憾到标量奖励赌博机问题遗憾的一种新颖归约。我们展示了如何将这种归约直接应用于非组合动作的情形下,为具有线性和一般奖励函数的CBCR获得算法。受推荐中公平性的启发,我们描述了具有排序和公平性感知目标的CBCR特例,从而首次提出了具有曝光公平性约束的情境组合赌博机具有遗憾保证的算法。