Multitask Learning and Bandits via Robust Statistics

Decision-makers often simultaneously face many related but heterogeneous learning problems. For instance, a large retailer may wish to learn product demand at different stores to solve pricing or inventory problems, making it desirable to learn jointly for stores serving similar customers; alternatively, a hospital network may wish to learn patient risk at different providers to allocate personalized interventions, making it desirable to learn jointly for hospitals serving similar patient populations. Motivated by real datasets, we study a natural setting where the unknown parameter in each learning instance can be decomposed into a shared global parameter plus a sparse instance-specific term. We propose a novel two-stage multitask learning estimator that exploits this structure in a sample-efficient way, using a unique combination of robust statistics (to learn across similar instances) and LASSO regression (to debias the results). Our estimator yields improved sample complexity bounds in the feature dimension $d$ relative to commonly-employed estimators; this improvement is exponential for "data-poor" instances, which benefit the most from multitask learning. We illustrate the utility of these results for online learning by embedding our multitask estimator within simultaneous contextual bandit algorithms. We specify a dynamic calibration of our estimator to appropriately balance the bias-variance tradeoff over time, improving the resulting regret bounds in the context dimension $d$. Finally, we illustrate the value of our approach on synthetic and real datasets.

翻译：决策者常常同时面临许多相关但异质的学习问题。例如，大型零售商可能希望学习不同门店的产品需求以解决定价或库存问题，因此对服务相似客户的门店进行联合学习是可取的；同样，医院网络可能希望学习不同医疗机构中的患者风险，以分配个性化干预措施，因此对服务相似患者群体的医院进行联合学习是有益的。受真实数据集的启发，我们研究了一个自然场景：每个学习实例中的未知参数可分解为一个共享的全局参数加上一个稀疏的实例特定项。我们提出了一种新颖的两阶段多任务学习估计器，该估计器通过鲁棒统计（用于跨相似实例学习）和LASSO回归（用于去偏结果）的独特组合，以样本高效的方式利用这种结构。我们的估计器在特征维度$d$上相比常用估计器实现了更优的样本复杂度界；对于“数据贫乏”的实例，这一改进是指数级的，它们从多任务学习中获益最多。通过将我们的多任务估计器嵌入到同步情境强盗算法中，我们阐述了这些结果在在线学习中的效用。我们指定了一种动态校准方法，以适时平衡偏差-方差权衡，从而在情境维度$d$上改进最终的后悔界。最后，我们在合成和真实数据集上展示了我们方法的价值。