Counterfactual Risk Minimization (CRM) is a framework for dealing with the logged bandit feedback problem, where the goal is to improve a logging policy using offline data. In this paper, we explore the case where it is possible to deploy learned policies multiple times and acquire new data. We extend the CRM principle and its theory to this scenario, which we call "Sequential Counterfactual Risk Minimization (SCRM)." We introduce a novel counterfactual estimator and identify conditions that can improve the performance of CRM in terms of excess risk and regret rates, by using an analysis similar to restart strategies in accelerated optimization methods. We also provide an empirical evaluation of our method in both discrete and continuous action settings, and demonstrate the benefits of multiple deployments of CRM.
翻译:反事实风险最小化(CRM)是处理日志化赌博机反馈问题的一个框架,其目标是利用离线数据改进日志策略。在本文中,我们探讨了可多次部署学得策略并获取新数据的情形。我们将CRM原理及其理论扩展到这一场景,并将其称为“顺序反事实风险最小化(SCRM)”。我们引入了一种新颖的反事实估计量,并通过类似于加速优化方法中的重启策略分析,识别了能在超额风险和遗憾率方面改善CRM性能的条件。此外,我们在离散和连续动作设置中提供了对方法的实证评估,并展示了多次部署CRM的优势。