We consider the problem of parameter estimation in a high-dimensional generalized linear model. Spectral methods obtained via the principal eigenvector of a suitable data-dependent matrix provide a simple yet surprisingly effective solution. However, despite their wide use, a rigorous performance characterization, as well as a principled way to preprocess the data, are available only for unstructured (i.i.d.\ Gaussian and Haar orthogonal) designs. In contrast, real-world data matrices are highly structured and exhibit non-trivial correlations. To address the problem, we consider correlated Gaussian designs capturing the anisotropic nature of the features via a covariance matrix $\Sigma$. Our main result is a precise asymptotic characterization of the performance of spectral estimators. This allows us to identify the optimal preprocessing that minimizes the number of samples needed for parameter estimation. Surprisingly, such preprocessing is universal across a broad set of designs, which partly addresses a conjecture on optimal spectral estimators for rotationally invariant models. Our principled approach vastly improves upon previous heuristic methods, including for designs common in computational imaging and genetics. The proposed methodology, based on approximate message passing, is broadly applicable and opens the way to the precise characterization of spiked matrices and of the corresponding spectral methods in a variety of settings.
翻译:我们考虑高维广义线性模型中的参数估计问题。通过适当数据依赖矩阵的主特征向量获得的谱方法提供了一种简单却出人意料的有效解决方案。然而,尽管这些方法被广泛使用,目前仅针对非结构化(独立同分布高斯和Haar正交)设计建立了严格的性能表征以及数据预处理的原理性方法。相比之下,现实世界的数据矩阵具有高度结构性并展现出非平凡的关联性。为解决该问题,我们考虑通过协方差矩阵$\Sigma$捕捉特征各向异性特性的相关高斯设计。我们的主要成果是对谱估计器性能的精确渐近表征。这使我们能够确定最小化参数估计所需样本量的最优预处理方法。令人惊讶的是,这种预处理在广泛的设计集合中具有普适性,这在一定程度上解决了关于旋转不变模型最优谱估计器的猜想。我们的原理性方法较以往启发式方法有显著改进,包括在计算成像和遗传学中常见的设计场景。所提出的基于近似消息传递的方法具有广泛适用性,为在各种设置下精确表征尖峰矩阵及相应谱方法开辟了新途径。