We study the problem of detecting the correlation between two Gaussian databases $\mathsf{X}\in\mathbb{R}^{n\times d}$ and $\mathsf{Y}^{n\times d}$, each composed of $n$ users with $d$ features. This problem is relevant in the analysis of social media, computational biology, etc. We formulate this as a hypothesis testing problem: under the null hypothesis, these two databases are statistically independent. Under the alternative, however, there exists an unknown permutation $\sigma$ over the set of $n$ users (or, row permutation), such that $\mathsf{X}$ is $\rho$-correlated with $\mathsf{Y}^\sigma$, a permuted version of $\mathsf{Y}$. We determine sharp thresholds at which optimal testing exhibits a phase transition, depending on the asymptotic regime of $n$ and $d$. Specifically, we prove that if $\rho^2d\to0$, as $d\to\infty$, then weak detection (performing slightly better than random guessing) is statistically impossible, irrespectively of the value of $n$. This compliments the performance of a simple test that thresholds the sum all entries of $\mathsf{X}^T\mathsf{Y}$. Furthermore, when $d$ is fixed, we prove that strong detection (vanishing error probability) is impossible for any $\rho<\rho^\star$, where $\rho^\star$ is an explicit function of $d$, while weak detection is again impossible as long as $\rho^2d\to0$. These results close significant gaps in current recent related studies.
翻译:我们研究两个高斯数据库 $\mathsf{X}\in\mathbb{R}^{n\times d}$ 与 $\mathsf{Y}^{n\times d}$ 之间相关性的检测问题,每个数据库由 $n$ 个用户和 $d$ 个特征组成。该问题在社交媒体分析、计算生物学等领域具有重要应用。我们将此建模为假设检验问题:在原假设下,两个数据库统计独立;而在备择假设下,存在一个作用于 $n$ 个用户集合的未知置换 $\sigma$(即行置换),使得 $\mathsf{X}$ 与 $\mathsf{Y}$ 的置换版本 $\mathsf{Y}^\sigma$ 呈 $\rho$ 相关性。我们确定了最优检验出现相变现象的尖锐阈值,该阈值取决于 $n$ 和 $d$ 的渐近 regime。具体而言,我们证明:若 $d\to\infty$ 时 $\rho^2d\to0$,则弱检测(性能略优于随机猜测)在统计上不可能实现,且该结论与 $n$ 的取值无关。这一结果补充了基于 $\mathsf{X}^T\mathsf{Y}$ 所有元素求和阈值的简单检验的性能分析。此外,当 $d$ 固定时,我们证明:对于任意 $\rho<\rho^\star$,强检测(误差概率趋近于零)不可能实现,其中 $\rho^\star$ 是 $d$ 的显式函数;而只要 $\rho^2d\to0$,弱检测同样不可能实现。这些结果填补了近期相关研究中的重大空白。