We consider the linear contextual multi-class multi-period packing problem (LMMP) where the goal is to pack items such that the total vector of consumption is below a given budget vector and the total value is as large as possible. We consider the setting where the reward and the consumption vector associated with each action is a class-dependent linear function of the context, and the decision-maker receives bandit feedback. LMMP includes linear contextual bandits with knapsacks and online revenue management as special cases. We establish a new estimator which guarantees a faster convergence rate, and consequently, a lower regret in such problems. We propose a bandit policy that is a closed-form function of said estimated parameters. When the contexts are non-degenerate, the regret of the proposed policy is sublinear in the context dimension, the number of classes, and the time horizon $T$ when the budget grows at least as $\sqrt{T}$. We also resolve an open problem posed by Agrawal & Devanur (2016) and extend the result to a multi-class setting. Our numerical experiments clearly demonstrate that the performance of our policy is superior to other benchmarks in the literature.
翻译:我们研究线性上下文多类别多周期背包问题(LMMP),其目标是在总消费向量不超过给定预算向量且总价值最大化的条件下打包物品。我们考虑如下设定:每个动作对应的奖励和消费向量是上下文的类别依赖线性函数,决策者仅获得强盗反馈。LMMP将带背包的线性上下文强盗问题和在线收益管理作为特例。我们建立了一种新估计器,能够保证更快的收敛速度,从而降低此类问题的遗憾值。我们提出一种带强盗策略,该策略是上述估计参数的闭式函数。当上下文非退化时,所提策略的遗憾值关于上下文维度、类别数量和时间范围$T$呈次线性增长,前提是预算至少以$\sqrt{T}$的速度增长。我们还解决了Agrawal & Devanur(2016)提出的一个开放问题,并将结果推广至多类别场景。数值实验表明,我们的策略性能明显优于文献中的其他基准方法。