Algorithms that ensure reproducible findings from large-scale, high-dimensional data are pivotal in numerous signal processing applications. In recent years, multivariate false discovery rate (FDR) controlling methods have emerged, providing guarantees even in high-dimensional settings where the number of variables surpasses the number of samples. However, these methods often fail to reliably control the FDR in the presence of highly dependent variable groups, a common characteristic in fields such as genomics and finance. To tackle this critical issue, we introduce a novel framework that accounts for general dependency structures. Our proposed dependency-aware T-Rex selector integrates hierarchical graphical models within the T-Rex framework to effectively harness the dependency structure among variables. Leveraging martingale theory, we prove that our variable penalization mechanism ensures FDR control. We further generalize the FDR-controlling framework by stating and proving a clear condition necessary for designing both graphical and non-graphical models that capture dependencies. Additionally, we formulate a fully integrated optimal calibration algorithm that concurrently determines the parameters of the graphical model and the T-Rex framework, such that the FDR is controlled while maximizing the number of selected variables. Numerical experiments and a breast cancer survival analysis use-case demonstrate that the proposed method is the only one among the state-of-the-art benchmark methods that controls the FDR and reliably detects genes that have been previously identified to be related to breast cancer. An open-source implementation is available within the R package TRexSelector on CRAN.
翻译:确保从大规模高维数据中得出可复现结果的算法,在众多信号处理应用中至关重要。近年来,多元错误发现率控制方法应运而生,甚至能在变量数量超过样本数的高维场景下提供统计保证。然而,在存在高度相依变量组的场景中(这是基因组学和金融学等领域的常见特征),这些方法往往无法可靠地控制错误发现率。为解决这一关键问题,我们提出了一种能够处理一般依赖结构的新框架。我们提出的依赖感知型T-Rex选择器,通过在T-Rex框架内集成层次化图模型,有效利用了变量间的依赖结构。基于鞅理论,我们证明了变量惩罚机制能确保错误发现率控制。通过阐明并证明设计捕获依赖关系的图模型与非图模型所需的明确条件,我们进一步推广了错误发现率控制框架。此外,我们构建了一个完全集成的优化校准算法,可同步确定图模型与T-Rex框架的参数,在控制错误发现率的同时最大化所选变量数量。数值实验与乳腺癌生存分析案例表明,在所有前沿基准方法中,只有本方法能控制错误发现率并可靠检测出先前已被证实与乳腺癌相关的基因。该方法的开源实现已收录于CRAN平台的R包TRexSelector中。