One challenge in exploratory association studies using observational data is that the signals are potentially weak and the features have complex correlation structures. False discovery rate (FDR) controlling procedures can provide important statistical guarantees for replicability in risk factor identification in exploratory research. In the recently established National COVID Collaborative Cohort (N3C), electronic health record (EHR) data on the same set of candidate features are independently collected in multiple different sites, offering opportunities to identify signals by combining information from different sources. This paper presents a general knockoff-based variable selection algorithm to identify mutual signals from unions of group-level conditional independence tests with exact FDR control guarantees under finite sample settings. This algorithm can work with general regression settings, allowing heterogeneity of both the predictors and the outcomes across multiple data sources. We demonstrate the performance of this method with extensive numerical studies and an application to the N3C data.
翻译:在利用观测数据进行探索性关联研究时,一个挑战在于信号通常较弱且特征之间具有复杂的相关结构。错误发现率(FDR)控制程序能为探索性研究中风险因素识别的可重复性提供重要的统计保障。在近期建立的国家新冠协作队列(N3C)中,相同候选特征集的电子健康记录(EHR)数据通过多个独立站点独立收集,这为通过整合不同来源的信息识别信号提供了机会。本文提出一种基于knockoff的通用变量选择算法,用于在有限样本条件下识别群体水平条件独立性检验并集的相互信号,且具有精确的FDR控制保证。该算法适用于一般回归设定,能够处理多个数据源中预测变量和结果变量的异质性。我们通过大量数值模拟以及N3C数据上的实际应用验证了本方法的性能。