The identification of the dependent components in multiple data sets is a fundamental problem in many practical applications. The challenge in these applications is that often the data sets are high-dimensional with few observations or available samples and contain latent components with unknown probability distributions. A novel mathematical formulation of this problem is proposed, which enables the inference of the underlying correlation structure with strict false positive control. In particular, the false discovery rate is controlled at a pre-defined threshold on two levels simultaneously. The deployed test statistics originate in the sample coherence matrix. The required probability models are learned from the data using the bootstrap. Local false discovery rates are used to solve the multiple hypothesis testing problem. Compared to the existing techniques in the literature, the developed technique does not assume an a priori correlation structure and work well when the number of data sets is large while the number of observations is small. In addition, it can handle the presence of distributional uncertainties, heavy-tailed noise, and outliers.
翻译:多个数据集中相关成分的识别是众多实际应用中的基本问题。这些应用面临的挑战在于,数据集往往具有高维特性(观测或可用样本极少),且包含概率分布未知的潜在成分。本文提出一种新颖的数学建模方法,能够在严格控制假阳性的前提下推断潜在的相关结构。具体而言,该方法在双重层面同步控制错误发现率至预设阈值。所采用的检验统计量源自样本相干矩阵,通过自助法从数据中学习所需的概率模型,并利用局部错误发现率解决多重假设检验问题。与现有文献技术相比,本方法无需预设先验相关结构,在数据集规模大而观测样本少的情况下表现优异。此外,该方法还能处理分布不确定性、重尾噪声及异常值存在场景。