Off-Policy Evaluation (OPE) aims to assess the effectiveness of counterfactual policies using only offline logged data and is often used to identify the top-k promising policies for deployment in online A/B tests. Existing evaluation metrics for OPE estimators primarily focus on the "accuracy" of OPE or that of downstream policy selection, neglecting risk-return tradeoff in the subsequent online policy deployment. To address this issue, we draw inspiration from portfolio evaluation in finance and develop a new metric, called SharpeRatio@k, which measures the risk-return tradeoff of policy portfolios formed by an OPE estimator under varying online evaluation budgets (k). We validate our metric in two example scenarios, demonstrating its ability to effectively distinguish between low-risk and high-risk estimators and to accurately identify the most efficient estimator. This efficient estimator is characterized by its capability to form the most advantageous policy portfolios, maximizing returns while minimizing risks during online deployment, a nuance that existing metrics typically overlook. To facilitate a quick, accurate, and consistent evaluation of OPE via SharpeRatio@k, we have also integrated this metric into an open-source software, SCOPE-RL. Employing SharpeRatio@k and SCOPE-RL, we conduct comprehensive benchmarking experiments on various estimators and RL tasks, focusing on their risk-return tradeoff. These experiments offer several interesting directions and suggestions for future OPE research.
翻译:离策略评估(OPE)旨在仅利用离线记录数据评估反事实策略的有效性,并常用于识别待部署于在线A/B测试中的前k个最有前途策略。现有OPE估计器的评估指标主要关注OPE的“准确性”或下游策略选择的准确性,而忽略了后续在线策略部署中的风险-收益权衡。为解决这一问题,我们从金融领域的投资组合评估中汲取灵感,提出了一种新指标——SharpeRatio@k,该指标衡量由OPE估计器在不同在线评估预算(k)下形成的策略组合的风险-收益权衡。通过两个示例场景验证,我们证明了该指标能够有效区分低风险与高风险估计器,并准确识别出最有效的估计器。这类高效估计器的特点在于其能形成最具优势的策略组合,在在线部署中最大化收益的同时最小化风险——这一细微之处是现有指标通常忽略的。为便于通过SharpeRatio@k对OPE进行快速、准确且一致的评估,我们已将该指标集成至开源软件SCOPE-RL中。利用SharpeRatio@k与SCOPE-RL,我们针对各类估计器与强化学习任务开展了全面基准实验,聚焦其风险-收益权衡。这些实验为未来OPE研究提供了若干有趣的方向与建议。