We consider contextual bandit problems with knapsacks [CBwK], a problem where at each round, a scalar reward is obtained and vector-valued costs are suffered. The learner aims to maximize the cumulative rewards while ensuring that the cumulative costs are lower than some predetermined cost constraints. We assume that contexts come from a continuous set, that costs can be signed, and that the expected reward and cost functions, while unknown, may be uniformly estimated -- a typical assumption in the literature. In this setting, total cost constraints had so far to be at least of order $T^{3/4}$, where $T$ is the number of rounds, and were even typically assumed to depend linearly on $T$. We are however motivated to use CBwK to impose a fairness constraint of equalized average costs between groups: the budget associated with the corresponding cost constraints should be as close as possible to the natural deviations, of order $\sqrt{T}$. To that end, we introduce a dual strategy based on projected-gradient-descent updates, that is able to deal with total-cost constraints of the order of $\sqrt{T}$ up to poly-logarithmic terms. This strategy is more direct and simpler than existing strategies in the literature. It relies on a careful, adaptive, tuning of the step size.
翻译:我们考虑具有背包约束的上下文赌博机问题[CBwK],其中每轮获得一个标量奖励并遭受向量值成本。学习者的目标是最大化累积奖励,同时确保累积成本低于某些预定成本约束。我们假设上下文来自连续集合,成本可正可负,且期望奖励和成本函数虽然未知,但可被一致估计——这是文献中的典型假设。在此设置下,总成本约束此前至少需达到$T^{3/4}$量级(其中$T$为轮数),且通常被假设与$T$线性相关。然而,我们受启发使用CBwK来施加组间平均成本均等化的公平性约束:相应成本约束对应的预算应尽可能接近自然波动量级$\sqrt{T}$。为此,我们提出一种基于投影梯度下降更新的对偶策略,能够处理量级为$\sqrt{T}$(至多含多对数项)的总成本约束。该策略比现有文献中的策略更直接、更简单,其关键在于对步长进行仔细的自适应调整。