Semi-supervised Method for Risk Prediction with Doubly Censored EHR Data

The rapid expansion of large-scale electronic health record (EHR) data offers unique opportunities to improve the accuracy and efficiency of clinical risk estimation. Yet, because clinical events may occur outside the recording health system, clinical event outcomes are frequently subject to double censoring (both left and right). Besides, gold-standard event times can often only be ascertained through labor-intensive manual chart reviews, yielding labels for only a small subset of patients. Reliance on this limited labeled set alone is limited in efficiency, whereas widely available surrogate outcomes such as the time to first diagnostic code or first disease mention are error-prone and can yield biased estimates if used directly. Semi-supervised learning (SSL) methods provide a principled way to integrate labeled and unlabeled data, and prior work has demonstrated their advantages in settings with binary or right-censored outcomes. However, existing approaches do not accommodate double censoring for risk prediction, which poses additional methodological challenges. To address this gap, we develop a novel SSL framework for risk prediction that combines a small set of gold-standard labels with large-scale surrogate information under double censoring. We establish the theoretical validity of the proposed estimator. Through extensive simulation studies, we show that our method substantially improves estimation efficiency relative to existing supervised estimators (based on the labeled data). Finally, we demonstrate its practical value by applying it to study risk factors for type 2 diabetes (T2D) using EHR data from a health system in the US.

翻译：大规模电子健康档案（EHR）数据的快速扩展为提高临床风险估计的准确性和效率提供了独特契机。然而，由于临床事件可能发生在记录医疗系统之外，临床事件结局常面临双删失（即左删失和右删失）问题。此外，金标准事件时间通常只能通过劳动密集型的人工病历审查获取，导致仅少数患者拥有标注标签。仅依赖这一有限标注数据集的效率受限，而广泛可用的替代结局（如首次诊断代码出现时间或首次疾病提及时间）则易产生误差，若直接使用会导致有偏估计。半监督学习（SSL）方法为整合标注与未标注数据提供了规范途径，已有研究证明了其在二值或右删失结局场景中的优势。然而，现有方法无法处理风险预测中的双删失问题，这带来了额外的方法学挑战。为弥补这一空白，我们开发了一种新颖的SSL风险预测框架，该框架在双删失条件下将少量金标准标签与大规模替代信息相结合。我们验证了所提估计量的理论有效性。通过大量仿真研究，我们证明该方法相较于现有的基于标注数据的监督估计方法，显著提升了估计效率。最后，我们利用美国某医疗系统的EHR数据研究2型糖尿病（T2D）的风险因素，展示了该方法的实际应用价值。