We study the stochastic contextual bandit with knapsacks (CBwK) problem, where each action, taken upon a context, not only leads to a random reward but also costs a random resource consumption in a vector form. The challenge is to maximize the total reward without violating the budget for each resource. We study this problem under a general realizability setting where the expected reward and expected cost are functions of contexts and actions in some given general function classes $\mathcal{F}$ and $\mathcal{G}$, respectively. Existing works on CBwK are restricted to the linear function class since they use UCB-type algorithms, which heavily rely on the linear form and thus are difficult to extend to general function classes. Motivated by online regression oracles that have been successfully applied to contextual bandits, we propose the first universal and optimal algorithmic framework for CBwK by reducing it to online regression. We also establish the lower regret bound to show the optimality of our algorithm for a variety of function classes.
翻译:我们研究随机性上下文带背包的上下文集束(CBwK)问题,其中每个动作在给定上下文下不仅产生随机奖励,还会以向量形式消耗随机资源。挑战在于在不超过每种资源预算的前提下最大化总奖励。我们在一般可实现性设置下研究该问题,其中期望奖励和期望成本分别是关于上下文和动作的函数,且分别属于给定的通用函数类 $\mathcal{F}$ 和 $\mathcal{G}$。现有CBwK研究局限于线性函数类,因为它们采用基于置信上界(UCB)的算法,这类算法高度依赖线性形式,难以推广至通用函数类。受成功应用于上下文赌博机的在线回归预言机启发,我们提出首个通用且最优的CBwK算法框架,通过将其简化为在线回归问题实现。我们还建立了下界遗憾界,以证明我们的算法在多种函数类上的最优性。