We study estimator selection and hyper-parameter tuning in off-policy evaluation. Although cross-validation is the most popular method for model selection in supervised learning, off-policy evaluation relies mostly on theory, which provides only limited guidance to practitioners. We show how to use cross-validation for off-policy evaluation. This challenges a popular belief that cross-validation in off-policy evaluation is not feasible. We evaluate our method empirically and show that it addresses a variety of use cases.
翻译:本研究探讨离线策略评估中的估计器选择与超参数调优问题。尽管交叉验证是监督学习中模型选择最常用的方法,但离线策略评估主要依赖理论指导,这为实践者提供的参考有限。我们提出了在离线策略评估中应用交叉验证的方法,这挑战了"离线策略评估中无法进行交叉验证"的普遍观点。我们通过实证方法评估了所提方案,并证明其能够有效应对多种应用场景。