Distributionally robust policy learning aims to find a policy that performs well under the worst-case distributional shift, and yet most existing methods for robust policy learning consider the worst-case joint distribution of the covariate and the outcome. The joint-modeling strategy can be unnecessarily conservative when we have more information on the source of distributional shifts. This paper studiesa more nuanced problem -- robust policy learning under the concept drift, when only the conditional relationship between the outcome and the covariate changes. To this end, we first provide a doubly-robust estimator for evaluating the worst-case average reward of a given policy under a set of perturbed conditional distributions. We show that the policy value estimator enjoys asymptotic normality even if the nuisance parameters are estimated with a slower-than-root-$n$ rate. We then propose a learning algorithm that outputs the policy maximizing the estimated policy value within a given policy class $\Pi$, and show that the sub-optimality gap of the proposed algorithm is of the order $\kappa(\Pi)n^{-1/2}$, with $\kappa(\Pi)$ is the entropy integral of $\Pi$ under the Hamming distance and $n$ is the sample size. A matching lower bound is provided to show the optimality of the rate. The proposed methods are implemented and evaluated in numerical studies, demonstrating substantial improvement compared with existing benchmarks.
翻译:分布鲁棒策略学习旨在寻找在最坏情况分布偏移下表现良好的策略,然而现有大多数鲁棒策略学习方法考虑的是协变量与结果变量的最坏情况联合分布。当我们对分布偏移的来源有更多信息时,这种联合建模策略可能变得不必要的保守。本文研究了一个更精细的问题——概念漂移下的鲁棒策略学习,即仅结果变量与协变量之间的条件关系发生变化的情况。为此,我们首先提出了一种双重稳健估计量,用于评估给定策略在一组扰动条件分布下的最坏情况平均奖励。我们证明即使当干扰参数以低于$n^{-1/2}$的速率估计时,该策略价值估计量仍具有渐近正态性。随后我们提出一种学习算法,该算法输出在给定策略类$\Pi$内最大化估计策略价值的策略,并证明所提算法的次优性间隙为$\kappa(\Pi)n^{-1/2}$阶,其中$\kappa(\Pi)$是$\Pi$在汉明距离下的熵积分,$n$为样本量。我们提供了匹配下界以证明该速率的最优性。所提方法在数值研究中得到实施与评估,相较于现有基准方法展现出显著改进。