In this paper, we study the problems of detection and recovery of hidden submatrices with elevated means inside a large Gaussian random matrix. We consider two different structures for the planted submatrices. In the first model, the planted matrices are disjoint, and their row and column indices can be arbitrary. Inspired by scientific applications, the second model restricts the row and column indices to be consecutive. In the detection problem, under the null hypothesis, the observed matrix is a realization of independent and identically distributed standard normal entries. Under the alternative, there exists a set of hidden submatrices with elevated means inside the same standard normal matrix. Recovery refers to the task of locating the hidden submatrices. For both problems, and for both models, we characterize the statistical and computational barriers by deriving information-theoretic lower bounds, designing and analyzing algorithms matching those bounds, and proving computational lower bounds based on the low-degree polynomials conjecture. In particular, we show that the space of the model parameters (i.e., number of planted submatrices, their dimensions, and elevated mean) can be partitioned into three regions: the impossible regime, where all algorithms fail; the hard regime, where while detection or recovery are statistically possible, we give some evidence that polynomial-time algorithm do not exist; and finally the easy regime, where polynomial-time algorithms exist.
翻译:本文研究大型高斯随机矩阵中具有升高均值的隐藏子矩阵的检测与恢复问题。我们考虑两种不同的植入子矩阵结构。第一种模型中,植入矩阵互不相交,其行列索引可任意分布。受科学应用启发,第二种模型限制行列索引必须连续。在检测问题中,原假设下观测矩阵为独立同分布标准正态样本的实现;备择假设下,同一标准正态矩阵内存在一组具有升高均值的隐藏子矩阵。恢复是指定位隐藏子矩阵的位置。针对这两个问题及两种模型,我们通过推导信息论下界、设计并分析与下界匹配的算法,以及基于低阶多项式猜想证明计算下界,刻画了统计与计算障碍。具体而言,我们证明了模型参数空间(即植入子矩阵数量、维度及升高均值)可划分为三个区域:不可能区(所有算法均失效)、困难区(检测或恢复虽在统计上可行,但存在证据表明多项式时间算法不存在)以及简单区(存在多项式时间算法)。