Typically, electronic health record data are not collected towards a specific research question. Instead, they comprise numerous observations recruited at different ages, whose medical, environmental and oftentimes also genetic data are being collected. Some phenotypes, such as disease-onset ages, may be reported retrospectively if the event preceded recruitment, and such observations are termed ``prevalent". The standard method to accommodate this ``delayed entry" conditions on the entire history up to recruitment, hence the retrospective prevalent failure times are conditioned upon and cannot participate in estimating the disease-onset age distribution. An alternative approach conditions just on survival up to recruitment age, plus the recruitment age itself. This approach allows incorporating the prevalent information but brings about numerical and computational difficulties. In this work we develop consistent estimators of the coefficients in a regression model for the age-at-onset, while utilizing the prevalent data. Asymptotic results are provided, and simulations are conducted to showcase the substantial efficiency gain that may be obtained by the proposed approach. In particular, the method is highly useful in leveraging large-scale repositories for replicability analysis of genetic variants. Indeed, analysis of urinary bladder cancer data reveals that the proposed approach yields about twice as many replicated discoveries compared to the popular approach.
翻译:通常,电子健康记录数据并非针对特定研究问题收集。相反,它们包含在不同年龄招募的大量观察对象,并收集其医疗、环境及常有的遗传数据。某些表型(如疾病发病年龄)可能在事件早于招募时被回顾性报告,此类观察称为“常见”数据。处理这种“延迟进入”的标准方法需条件化至招募前的完整历史,因此回顾性常见失效时间被条件化,无法参与估计疾病发病年龄分布。另一种方法仅条件化至招募年龄的生存状态及招募年龄本身。这种方法允许纳入常见信息,但带来了数值与计算上的困难。在本工作中,我们开发了利用常见数据的发病年龄回归模型系数的相合估计量,给出了渐近结果,并通过模拟展示了所提方法可能带来的显著效率提升。尤其,该方法在利用大规模数据库进行遗传变异可重复性分析中极具价值。对膀胱癌数据的分析表明,与流行方法相比,所提方法可产生约两倍的可重复发现。