We consider contextual bandit problems with knapsacks [CBwK], a problem where at each round, a scalar reward is obtained and vector-valued costs are suffered. The learner aims to maximize the cumulative rewards while ensuring that the cumulative costs are lower than some predetermined cost constraints. We assume that contexts come from a continuous set, that costs can be signed, and that the expected reward and cost functions, while unknown, may be uniformly estimated -- a typical assumption in the literature. In this setting, total cost constraints had so far to be at least of order $T^{3/4}$, where $T$ is the number of rounds, and were even typically assumed to depend linearly on $T$. We are however motivated to use CBwK to impose a fairness constraint of equalized average costs between groups: the budget associated with the corresponding cost constraints should be as close as possible to the natural deviations, of order $\sqrt{T}$. To that end, we introduce a dual strategy based on projected-gradient-descent updates, that is able to deal with total-cost constraints of the order of $\sqrt{T}$ up to poly-logarithmic terms. This strategy is more direct and simpler than existing strategies in the literature. It relies on a careful, adaptive, tuning of the step size.
翻译:我们考虑带有背包约束的上下文赌博机问题[CBwK],该问题中,每轮博弈会获得一个标量奖励并遭受向量值成本。学习者的目标是在确保累积成本低于预设成本约束的前提下最大化累积奖励。我们假设上下文来自连续集合,成本可正可负,且期望奖励和成本函数虽未知但可被一致估计——这是文献中的典型假设。在此设定下,总成本约束此前至少需达到$T^{3/4}$量级(其中$T$为轮数),甚至通常被假设为与$T\)线性相关。然而,我们使用CBwK的动机在于施加组间平均成本均等化的公平性约束:相应成本约束的预算应尽可能接近自然偏差量级$\sqrt{T}$。为此,我们引入一种基于投影梯度下降更新的对偶策略,该策略能够处理至多含多对数项修正的$\sqrt{T}$量级总成本约束。相比现有文献中的策略,本方法更为直接简洁,其关键在于对步长进行谨慎的自适应调节。