In this paper, we study the problems of detection and recovery of hidden submatrices with elevated means inside a large Gaussian random matrix. We consider two different structures for the planted submatrices. In the first model, the planted matrices are disjoint, and their row and column indices can be arbitrary. Inspired by scientific applications, the second model restricts the row and column indices to be consecutive. In the detection problem, under the null hypothesis, the observed matrix is a realization of independent and identically distributed standard normal entries. Under the alternative, there exists a set of hidden submatrices with elevated means inside the same standard normal matrix. Recovery refers to the task of locating the hidden submatrices. For both problems, and for both models, we characterize the statistical and computational barriers by deriving information-theoretic lower bounds, designing and analyzing algorithms matching those bounds, and proving computational lower bounds based on the low-degree polynomials conjecture. In particular, we show that the space of the model parameters (i.e., number of planted submatrices, their dimensions, and elevated mean) can be partitioned into three regions: the impossible regime, where all algorithms fail; the hard regime, where while detection or recovery are statistically possible, we give some evidence that polynomial-time algorithm do not exist; and finally the easy regime, where polynomial-time algorithms exist.
翻译:本文研究了在大规模高斯随机矩阵中,检测和恢复具有均值提升的隐藏子矩阵的问题。我们考虑了植入子矩阵的两种不同结构。第一种模型中,植入矩阵互不相交,其行索引与列索引可以是任意的。受科学应用启发,第二种模型将行索引与列索引限制为连续的。在检测问题中,原假设下观测矩阵是独立同分布标准正态随机变量的实现;备择假设下,同一标准正态矩阵中存在一组具有均值提升的隐藏子矩阵。恢复则指定位隐藏子矩阵的位置。针对这两个问题及两种模型,我们通过推导信息论下界、设计并分析与这些下界匹配的算法、以及基于低阶多项式猜想证明计算复杂性下界,刻画了统计与计算障碍。特别地,我们证明模型参数空间(即植入子矩阵的数量、其维度及均值提升量)可划分为三个区域:不可能区(所有算法均失效)、困难区(尽管检测或恢复在统计上可行,但存在证据表明多项式时间算法不存在),以及简单区(存在多项式时间算法)。