Off-policy evaluation (OPE) and off-policy learning (OPL) are foundational for decision-making in offline contextual bandits. Recent advances in OPL primarily optimize OPE estimators with improved statistical properties, assuming that better estimators inherently yield superior policies. Although theoretically justified, this estimator-centric approach neglects a critical practical obstacle: challenging optimization landscapes. In this paper, we provide theoretical insights and empirical evidence showing that current OPL methods encounter severe optimization issues, particularly as the action space grows. We show that estimator-aware policy parametrization can mitigate, but not fully resolve, optimization challenges. Building on this, we explore simpler weighted log-likelihood objectives and demonstrate that they enjoy substantially better optimization properties and still recover competitive, often superior, learned policies. Our findings emphasize the necessity of explicitly addressing optimization considerations in the development of OPL algorithms for large action spaces.
翻译:离策略评估(OPE)与离策略学习(OPL)是离线上下文赌博机中决策制定的基础。近期OPL研究主要侧重于优化具有更好统计特性的OPE估计量,其核心假设是更优的估计量必然能带来更优策略。尽管该观点具有理论依据,但这种以估计为中心的方法忽视了一个关键的实际障碍:复杂的优化地形。本文通过理论分析与实验证据表明,现有OPL方法在动作空间增大时会遭遇严重的优化问题。我们证明,考虑估计量的策略参数化虽能缓解但无法完全解决优化挑战。基于此,我们探索了更简单的加权对数似然目标函数,并证明其具有显著更优的优化特性,同时仍能恢复具有竞争力甚至更优的学习策略。我们的研究强调了在大动作空间下开发OPL算法时必须明确考虑优化问题的重要性。