Identifying which variables do influence a response while controlling false positives pervades statistics and data science. In this paper, we consider a scenario in which we only have access to summary statistics, such as the values of marginal empirical correlations between each dependent variable of potential interest and the response. This situation may arise due to privacy concerns, e.g., to avoid the release of sensitive genetic information. We extend GhostKnockoffs (He et al. [2022]) and introduce variable selection methods based on penalized regression achieving false discovery rate (FDR) control. We report empirical results in extensive simulation studies, demonstrating enhanced performance over previous work. We also apply our methods to genome-wide association studies of Alzheimer's disease, and evidence a significant improvement in power.
翻译:识别哪些变量影响响应并同时控制假阳性贯穿于统计学和数据科学。在本文中,我们考虑一种场景:仅能访问汇总统计数据,例如每个潜在相关因变量与响应之间的边际经验相关性值。这种情况可能出于隐私考虑而出现,例如为避免发布敏感遗传信息。我们扩展了GhostKnockoffs(He等,2022),并引入了基于惩罚回归的变量选择方法,实现了对错误发现率(FDR)的控制。我们在广泛模拟研究中报告了实证结果,展示了相较于先前工作的性能提升。我们还将方法应用于阿尔茨海默病的全基因组关联研究,并证明了功效的显著提升。