Off-policy evaluation and learning in contextual bandits use logged interaction data to estimate and optimize the value of a target policy. Most existing methods require sufficient action overlap between the logging and target policies, and violations can bias value and policy gradient estimates. To address this issue, we propose DOLCE (Decomposing Off-policy evaluation/learning into Lagged and Current Effects), which uses only lagged contexts already stored in bandit logs to construct lag-marginalized importance weights and to decompose the objective into a support-robust lagged correction term and a current, model-based term, yielding bias cancellation when the reward-model residual is conditionally mean-zero given the lagged context and action. With multiple candidate lags, DOLCE softly aggregates lag-specific estimates, and we introduce a moment-based training procedure that promotes the desired invariance using only logged lag-augmented data. We show that DOLCE is unbiased in an idealized setting and yields consistent and asymptotically normal estimates with cross-fitting under standard conditions. Our experiments demonstrate that DOLCE achieves substantial improvements in both off-policy evaluation and learning, particularly as the proportion of individuals who violate support increases.
翻译:在上下文赌博机中,离策略评估与学习利用已记录的交互数据来估计并优化目标策略的价值。现有方法大多要求记录策略与目标策略之间存在充分的动作重叠,若违反此条件则可能导致价值与策略梯度估计产生偏差。为解决此问题,我们提出DOLCE(将离策略评估/学习分解为滞后效应与当前效应),该方法仅利用赌博机日志中已存储的滞后上下文来构建滞后边际重要性权重,并将目标函数分解为支持鲁棒的滞后校正项与基于模型的当前项,从而在给定滞后上下文与动作时奖励模型残差的条件期望为零的情况下实现偏差抵消。当存在多个候选滞后阶数时,DOLCE采用软聚合方式整合各滞后阶数的估计值,并提出一种基于矩的训练方法,仅利用记录的滞后增强数据即可促进所需的恒定性。我们证明在理想化设定下DOLCE具有无偏性,且在标准条件下通过交叉拟合可获得一致且渐近正态的估计。实验表明,DOLCE在离策略评估与学习任务中均取得显著改进,尤其在违反支持条件的个体比例增加时效果更为突出。