Off-policy evaluation and learning are concerned with assessing a given policy and learning an optimal policy from offline data without direct interaction with the environment. Often, the environment in which the data are collected differs from the environment in which the learned policy is applied. To account for the effect of different environments during learning and execution, distributionally robust optimization (DRO) methods have been developed that compute worst-case bounds on the policy values assuming that the distribution of the new environment lies within an uncertainty set. Typically, this uncertainty set is defined based on the KL divergence around the empirical distribution computed from the logging dataset. However, the KL uncertainty set fails to encompass distributions with varying support and lacks awareness of the geometry of the distribution support. As a result, KL approaches fall short in addressing practical environment mismatches and lead to over-fitting to worst-case scenarios. To overcome these limitations, we propose a novel DRO approach that employs the Wasserstein distance instead. While Wasserstein DRO is generally computationally more expensive compared to KL DRO, we present a regularized method and a practical (biased) stochastic gradient descent method to optimize the policy efficiently. We also provide a theoretical analysis of the finite sample complexity and iteration complexity for our proposed method. We further validate our approach using a public dataset that was recorded in a randomized stoke trial.
翻译:离线策略评估与学习旨在无需与环境的直接交互,通过离线数据评估给定策略并学习最优策略。实践中,数据收集环境与学习策略的应用环境往往存在差异。为应对学习与执行过程中不同环境的影响,分布鲁棒优化方法被提出,该方法假设新环境的分布位于一个不确定性集内,并据此计算策略值的最坏情况边界。通常,该不确定性集基于数据记录集的经验分布,以KL散度定义。然而,KL不确定性集无法涵盖具有不同支撑集的分布,且缺乏对分布支撑几何结构的感知能力。因此,KL方法难以应对实际环境不匹配问题,并容易导致对最坏情况的过拟合。为克服这些局限,我们提出一种新型分布鲁棒优化方法,采用Wasserstein距离替代KL散度。尽管Wasserstein分布鲁棒优化通常比KL方法计算代价更高,但本文提出了一种正则化方法及实用的(有偏)随机梯度下降方法,以实现策略的高效优化。同时,我们给出了所提方法在有限样本复杂度与迭代复杂度方面的理论分析。最后,我们使用一项随机卒中试验中记录的公开数据集验证了该方法的有效性。