Dynamic Batch Learning in High-Dimensional Sparse Linear Contextual Bandits

We study the problem of dynamic batch learning in high-dimensional sparse linear contextual bandits, where a decision maker can only adapt decisions at a batch level. In particular, the decision maker, only observing rewards at the end of each batch, dynamically decides how many individuals to include in the next batch (at the current batch's end) and what personalized action-selection scheme to adopt within the batch. Such batch constraints are ubiquitous in a variety of practical contexts, including personalized product offerings in marketing and medical treatment selection in clinical trials. We characterize the fundamental learning limit in this problem via a novel lower bound analysis and provide a simple, exploration-free algorithm that uses the LASSO estimator, which achieves the minimax optimal performance characterized by the lower bound (up to log factors). To our best knowledge, our work provides the first inroad into a rigorous understanding of dynamic batch learning with high-dimensional covariates. We also demonstrate the efficacy of our algorithm on both synthetic data and the Warfarin medical dosing data. The empirical results show that with three batches (hence only two opportunities to adapt), our algorithm already performs comparably (in terms of statistical performance) to the state-of-the-art fully online high-dimensional linear contextual bandits algorithm. As an added bonus, since our algorithm operates in batches, it is orders of magnitudes faster than fully online learning algorithms. As such, our algorithm provides a desirable candidate for practical data-driven personalized decision making problems, where limited adaptivity is often a hard constraint.

翻译：我们研究在高度分散的线性背景土匪中动态批量学习的问题,即决策人只能调整批量一级的决定。特别是,决策者,只观察每批末端的奖励,动态地决定下批(目前批尾端)要包括多少个人,以及批量中要采用什么个性化的行动选择办法。这些批量限制在各种实际环境中普遍存在,包括销售和临床试验中医疗治疗选择中的个人化产品提供。我们通过新颖的较低约束分析来描述这一问题的基本学习限度,并提供一个简单、无探索的算法,使用LASSO的估量器,该算法将实现下批量(目前批尾端)中最优性的表现。根据我们的最佳知识,我们的工作为严格了解动态批量学习与高维度变异性。我们还展示了我们关于合理合成数据和Warfarinal In medicial dos data的算法的功效。实验结果显示,用三批次(因为只有两批次个人的硬度,而使在线的算算法能够完全调整,我们的直线性算的递进进级算数据,因此,我们不断的递化的递进进进进进化的算法的算法是完全的升级的升级的升级的逻辑,我们的一个高级的逻辑要求。