Time-to-event (TTE) models are used in medicine and other fields for estimating the probability distribution of the time until a specific event occurs. TTE models provide many advantages over classification using fixed time horizons, including naturally handling censored observations, but require more parameters and are challenging to train in settings with limited labeled data. Existing approaches, e.g. proportional hazards or accelerated failure time, employ distributional assumptions to reduce parameters but are vulnerable to model misspecification. In this work, we address these challenges with MOTOR (Many Outcome Time Oriented Representations), a self-supervised model that leverages temporal structure found in collections of timestamped events in electronic health records (EHR) and health insurance claims. MOTOR uses a TTE pretraining objective that predicts the probability distribution of times when events occur, making it well-suited to transfer learning for medical prediction tasks. Having pretrained on EHR and claims data of up to 55M patient records (9B clinical events), we evaluate performance after finetuning for 19 tasks across two datasets. Task-specific models built using MOTOR improve time-dependent C statistics by 4.6% over state-of-the-art while greatly improving sample efficiency, achieving comparable performance to existing methods using only 5% of available task data.
翻译:时间-事件(TTE)模型用于医学及其他领域,旨在估计特定事件发生时间的概率分布。相较于固定时间窗口的分类方法,TTE模型具有显著优势,例如自然处理删失观测值,但其参数需求更大,且在标注数据有限的场景中训练颇具挑战。现有方法(如比例风险模型或加速失效时间模型)通过引入分布假设来减少参数,却易受模型设定错误的影响。针对上述问题,本研究提出MOTOR(多结局时间导向表征)——一种利用电子健康记录(EHR)及健康保险索赔数据中时间戳事件集合的时序结构的自监督模型。MOTOR采用基于TTE的预训练目标,预测事件发生时间的概率分布,因而特别适用于医学预测任务的迁移学习。基于包含至多5500万患者记录(90亿临床事件)的EHR与索赔数据完成预训练后,我们在两个数据集的19项任务中评估了微调后的性能。基于MOTOR构建的任务特定模型将时间依赖C统计量提升4.6%(超越现有最优方法),同时大幅提高样本效率,仅使用5%的任务数据即可达到与传统方法相当的性能。