Predicting individualized potential outcomes in sequential decision-making is central for optimizing therapeutic decisions in personalized medicine (e.g., which dosing sequence to give to a cancer patient). However, predicting potential outcomes over long horizons is notoriously difficult. Existing methods that break the curse of the horizon typically lack strong theoretical guarantees such as orthogonality and quasi-oracle efficiency. In this paper, we revisit the problem of predicting individualized potential outcomes in sequential decision-making (i.e., estimating Q-functions in Markov decision processes with observational data) through a causal inference lens. In particular, we develop a comprehensive theoretical foundation for meta-learners in this setting with a focus on beneficial theoretical properties. As a result, we yield a novel meta-learner called DRQ-learner and establish that it is: (1) doubly robust (i.e., valid inference under the misspecification of one of the nuisances), (2) Neyman-orthogonal (i.e., insensitive to first-order estimation errors in the nuisance functions), and (3) achieves quasi-oracle efficiency (i.e., behaves asymptotically as if the ground-truth nuisance functions were known). Our DRQ-learner is applicable to settings with both discrete and continuous state spaces. Further, our DRQ-learner is flexible and can be used together with arbitrary machine learning models (e.g., neural networks). We validate our theoretical results through numerical experiments, thereby showing that our meta-learner outperforms state-of-the-art baselines.
翻译:在序列决策中预测个体化潜在结果对于优化个性化医疗中的治疗决策(例如,应为癌症患者提供何种给药序列)至关重要。然而,预测长时程的潜在结果极为困难。现有突破"维度诅咒"的方法通常缺乏正交性和准预言机效率等强理论保证。本文通过因果推断的视角,重新审视序列决策中的个体化潜在结果预测问题(即利用观测数据估计马尔可夫决策过程中的Q函数)。特别地,我们为此背景下的元学习器建立了全面的理论基础,并重点关注其有益的理论性质。由此,我们提出了一种名为DRQ-learner的新型元学习器,并证明其具有以下特性:(1) 双重稳健性(即在一个干扰参数设定错误的情况下仍能进行有效推断),(2) 奈曼正交性(即对干扰函数的一阶估计误差不敏感),(3) 达到准预言机效率(即其渐近表现如同已知真实的干扰函数)。我们的DRQ-learner适用于离散和连续状态空间场景。此外,该学习器具有灵活性,可与任意机器学习模型(如神经网络)结合使用。我们通过数值实验验证了理论结果,表明我们的元学习器性能优于现有先进基线方法。