We propose the first boosting algorithm for off-policy learning from logged bandit feedback. Unlike existing boosting methods for supervised learning, our algorithm directly optimizes an estimate of the policy's expected reward. We analyze this algorithm and prove that the excess empirical risk decreases (possibly exponentially fast) with each round of boosting, provided a ''weak'' learning condition is satisfied by the base learner. We further show how to reduce the base learner to supervised learning, which opens up a broad range of readily available base learners with practical benefits, such as decision trees. Experiments indicate that our algorithm inherits many desirable properties of tree-based boosting algorithms (e.g., robustness to feature scaling and hyperparameter tuning), and that it can outperform off-policy learning with deep neural networks as well as methods that simply regress on the observed rewards.
翻译:我们提出了首个针对日志化情景赌博反馈中离线策略学习的提升算法。与现有的监督学习提升方法不同,该算法直接优化策略期望回报的估计值。我们对此算法进行了分析,证明若基学习器满足"弱"学习条件,则每轮提升后超额经验风险将呈(可能指数级)下降趋势。进一步地,我们展示了如何将基学习器简化为监督学习问题,从而可广泛利用现成的基学习器(如决策树)获得实际效益。实验表明,该算法继承了基于树的提升算法的诸多优良特性(如对特征缩放和超参数调优的鲁棒性),并且在离线策略学习中的表现优于深度神经网络方法及仅对观测回报进行回归的方法。