Off-policy learning (OPL) aims at finding improved policies from logged bandit data, often by minimizing the inverse propensity scoring (IPS) estimator of the risk. In this work, we investigate a smooth regularization for IPS, for which we derive a two-sided PAC-Bayes generalization bound. The bound is tractable, scalable, interpretable and provides learning certificates. In particular, it is also valid for standard IPS without making the assumption that the importance weights are bounded. We demonstrate the relevance of our approach and its favorable performance through a set of learning tasks. Since our bound holds for standard IPS, we are able to provide insight into when regularizing IPS is useful. Namely, we identify cases where regularization might not be needed. This goes against the belief that, in practice, clipped IPS often enjoys favorable performance than standard IPS in OPL.
翻译:离线策略学习(OPL)旨在从记录的赌博机数据中发现改进策略,通常通过最小化风险的反倾向得分(IPS)估计量来实现。本文研究了一种针对IPS的平滑正则化方法,并为此推导出一个双侧PAC-Bayes泛化界。该界具有可计算性、可扩展性、可解释性,并能提供学习保证。特别地,该界在无需假设重要性权重有界的情况下,对标准IPS同样成立。我们通过一系列学习任务证明了本方法的相关性及其优越性能。由于所提界适用于标准IPS,我们能够洞察何时对IPS进行正则化是有益的——换言之,我们识别出可能不需要正则化的情形。这一发现挑战了实践中截断IPS通常比标准IPS在OPL中表现更佳的普遍认知。