We consider policy optimization in contextual bandits, where one is given a fixed dataset of logged interactions. While pessimistic regularizers are typically used to mitigate distribution shift, prior implementations thereof are not computationally efficient. We present the first oracle-efficient algorithm for pessimistic policy optimization: it reduces to supervised learning, leading to broad applicability. We also obtain best-effort statistical guarantees analogous to those for pessimistic approaches in prior work. We instantiate our approach for both discrete and continuous actions. We perform extensive experiments in both settings, showing advantage over unregularized policy optimization across a wide range of configurations.
翻译:我们考虑上下文赌博机中的策略优化问题,其中给定一个固定的日志交互数据集。虽然悲观正则化通常用于缓解分布偏移,但先前的实现方法在计算上并不高效。我们提出了首个用于悲观策略优化的oracle高效算法:该算法可简化为监督学习问题,从而具有广泛适用性。我们还获得了与先前工作中悲观方法类似的最佳努力统计保证。我们将该方法同时应用于离散动作和连续动作场景。在这两种设置下进行了大量实验,结果表明在广泛的配置范围内,该方法相较于未正则化的策略优化具有显著优势。