The identification of the dependent components in multiple data sets is a fundamental problem in many practical applications. The challenge in these applications is that often the data sets are high-dimensional with few observations or available samples and contain latent components with unknown probability distributions. A novel mathematical formulation of this problem is proposed, which enables the inference of the underlying correlation structure with strict false positive control. In particular, the false discovery rate is controlled at a pre-defined threshold on two levels simultaneously. The deployed test statistics originate in the sample coherence matrix. The required probability models are learned from the data using the bootstrap. Local false discovery rates are used to solve the multiple hypothesis testing problem. Compared to the existing techniques in the literature, the developed technique does not assume an a priori correlation structure and work well when the number of data sets is large while the number of observations is small. In addition, it can handle the presence of distributional uncertainties, heavy-tailed noise, and outliers.
翻译:多数据集中的依赖成分识别是许多实际应用中的基本问题。这类应用的挑战在于:数据集通常具有高维特性但观测样本稀少,且包含概率分布未知的潜在成分。本文提出该问题的一种新型数学表述,能够在严格控制假阳性的前提下推断潜在的相关结构。具体而言,错误发现率在预设阈值下同时实现两个层面的控制。所采用的检验统计量源于样本相干矩阵。所需的概率模型通过自助法从数据中学习得到。利用局部错误发现率解决多重假设检验问题。与文献现有技术相比,本方法无需假设先验相关结构,在数据集数量大而观测样本少的情况下仍表现良好。此外,该方法能够处理分布不确定性、重尾噪声和异常值的存在。