We consider the linear contextual multi-class multi-period packing problem~(LMMP) where the goal is to pack items such that the total vector of consumption is below a given budget vector and the total value is as large as possible. We consider the setting where the reward and the consumption vector associated with each action is a class-dependent linear function of the context, and the decision-maker receives bandit feedback. LMMP includes linear contextual bandits with knapsacks and online revenue management as special cases. We establish a new more efficient estimator which guarantees a faster convergence rate, and consequently, a lower regret in such problems. We propose a bandit policy that is a closed-form function of said estimated parameters. When the contexts are non-degenerate, the regret of the proposed policy is sublinear in the context dimension, the number of classes, and the time horizon~$T$ when the budget grows at least as $\sqrt{T}$. We also resolve an open problem posed in Agrawal & Devanur (2016), and extend the result to a multi-class setting. Our numerical experiments clearly demonstrate that the performance of our policy is superior to other benchmarks in the literature.
翻译:我们考虑线性情境化多类别多周期打包问题(LMMP),其目标是在总消费向量不超过给定预算向量的前提下,尽可能最大化打包物品的总价值。我们假设每个动作的奖励和消费向量是情境的类别依赖性线性函数,且决策者仅能获得斑图反馈。LMMP模型包含带背包约束的线性情境化斑图算法与在线收益管理等特例。我们建立了一种更高效的新估计器,可保证更快的收敛速度,从而降低此类问题的遗憾值。我们提出了一种基于闭式函数的斑图策略,该函数由估计参数直接定义。当情境非退化时,若预算至少以$\sqrt{T}$量级增长,所提策略的遗憾值在情境维度、类别数量和时间跨度$T$上均为次线性增长。我们还解决了Agrawal & Devanur (2016)提出的一个未解决问题,并将结果推广至多类别场景。数值实验表明,我们的策略性能显著优于文献中的其他基准方法。