Machine learning (ML) holds great potential for accurately forecasting treatment outcomes over time, which could ultimately enable the adoption of more individualized treatment strategies in many practical applications. However, a significant challenge that has been largely overlooked by the ML literature on this topic is the presence of informative sampling in observational data. When instances are observed irregularly over time, sampling times are typically not random, but rather informative -- depending on the instance's characteristics, past outcomes, and administered treatments. In this work, we formalize informative sampling as a covariate shift problem and show that it can prohibit accurate estimation of treatment outcomes if not properly accounted for. To overcome this challenge, we present a general framework for learning treatment outcomes in the presence of informative sampling using inverse intensity-weighting, and propose a novel method, TESAR-CDE, that instantiates this framework using Neural CDEs. Using a simulation environment based on a clinical use case, we demonstrate the effectiveness of our approach in learning under informative sampling.
翻译:机器学习在随时间准确预测治疗结果方面具有巨大潜力,这最终可能促使在许多实际应用中采用更加个性化的治疗策略。然而,机器学习文献在该主题上大多忽略了一个重大挑战:观察性数据中存在信息采样。当实例随时间不规则观测时,采样时间通常不是随机的,而是具有信息性——取决于实例的特征、历史结果和施加的治疗。本研究将信息采样形式化为协变量偏移问题,并表明若未妥善处理,它会妨碍治疗结果的准确估计。为应对这一挑战,我们提出一个通用框架,利用逆强度加权在信息采样存在的情况下学习治疗结果,并设计了一种新颖方法TESAR-CDE,通过神经CDE实现该框架。基于临床使用案例的模拟环境,我们展示了该方法在信息采样下进行学习的有效性。