Detecting and recovering a low-rank signal in a noisy data matrix is a fundamental task in data analysis. Typically, this task is addressed by inspecting and manipulating the spectrum of the observed data, e.g., thresholding the singular values of the data matrix at a certain critical level. This approach is well-established in the case of homoskedastic noise, where the noise variance is identical across the entries. However, in numerous applications, the noise can be heteroskedastic, where the noise characteristics may vary considerably across the rows and columns of the data. In this scenario, the spectral behavior of the noise can differ significantly from the homoskedastic case, posing various challenges for signal detection and recovery. To address these challenges, we develop an adaptive normalization procedure that equalizes the average noise variance across the rows and columns of a given data matrix. Our proposed procedure is data-driven and fully automatic, supporting a broad range of noise distributions, variance patterns, and signal structures. We establish that in many cases, this procedure enforces the standard spectral behavior of homoskedastic noise -- the Marchenko-Pastur (MP) law, allowing for simple and reliable detection of signal components. Furthermore, we demonstrate that our approach can substantially improve signal recovery in heteroskedastic settings by manipulating the spectrum after normalization. Lastly, we apply our method to single-cell RNA sequencing and spatial transcriptomics data, showcasing accurate fits to the MP law after normalization. Our approach relies on recent results in random matrix theory, which describe the resolvent of the noise via the so-called Dyson equation. By leveraging this relation, we can accurately infer the noise level in each row and each column directly from the resolvent of the data.
翻译:从含噪数据矩阵中检测并恢复低秩信号是数据分析中的基础任务。通常,该任务通过检验并调整观测数据的谱特征来解决,例如将数据矩阵的奇异值在某一临界水平进行阈值处理。在噪声方差在矩阵元素间恒定的同方差噪声情形下,该方法已相当成熟。然而,在许多应用中,噪声可能具有异方差性,其特性在数据的行与列间可能显著变化。在此场景下,噪声的谱行为可能与同方差情形大相径庭,给信号检测与恢复带来诸多挑战。为应对这些挑战,我们开发了一种自适应归一化过程,能够平均化给定数据矩阵各行与列间的噪声方差。所提方法完全数据驱动且自动化,支持广泛的噪声分布、方差模式及信号结构。我们证明,在许多情况下,该过程能促使噪声呈现同方差的谱特性——即马琴科-帕斯图尔(MP)律,从而实现对信号成分的简单可靠检测。此外,我们展示,通过在归一化后调整谱特征,该方法可显著提升异方差场景下的信号恢复效果。最后,我们将此方法应用于单细胞RNA测序和空间转录组数据,展示归一化后数据与MP律的精确拟合。我们的方法基于随机矩阵理论的最新成果,该理论通过所谓戴森方程描述噪声的预解式。凭借这一关系,我们可直接从数据的预解式中精确推断每一行和每一列的噪声水平。