This paper studies policy evaluation with multiple data sources, especially in scenarios that involve one experimental dataset with two arms, complemented by a historical dataset generated under a single control arm. We propose novel data integration methods that linearly integrate base policy value estimators constructed based on the experimental and historical data, with weights optimized to minimize the mean square error (MSE) of the resulting combined estimator. We further apply the pessimistic principle to obtain more robust estimators, and extend these developments to sequential decision making. Theoretically, we establish non-asymptotic error bounds for the MSEs of our proposed estimators, and derive their oracle, efficiency and robustness properties across a broad spectrum of reward shift scenarios. Numerical experiments and real-data-based analyses from a ridesharing company demonstrate the superior performance of the proposed estimators.
翻译:本文研究利用多数据源进行策略评估的问题,尤其关注包含双臂实验数据集与单臂控制历史数据集相结合的场景。我们提出新颖的数据整合方法,将基于实验数据和历史数据构建的基础策略价值估计量进行线性整合,并通过优化权重以最小化最终组合估计量的均方误差。进一步应用悲观原则以获得更稳健的估计量,并将这些方法扩展至序列决策场景。理论上,我们为所提估计量的均方误差建立了非渐近误差界,并在广泛的奖励偏移场景下推导了其oracle性质、效率性与鲁棒性。来自某网约车公司的数值实验与真实数据分析证明了所提估计量的优越性能。