We consider the problem of parameter estimation in a high-dimensional generalized linear model. Spectral methods obtained via the principal eigenvector of a suitable data-dependent matrix provide a simple yet surprisingly effective solution. However, despite their wide use, a rigorous performance characterization, as well as a principled way to preprocess the data, are available only for unstructured (i.i.d.\ Gaussian and Haar orthogonal) designs. In contrast, real-world data matrices are highly structured and exhibit non-trivial correlations. To address the problem, we consider correlated Gaussian designs capturing the anisotropic nature of the features via a covariance matrix $\Sigma$. Our main result is a precise asymptotic characterization of the performance of spectral estimators. This allows us to identify the optimal preprocessing that minimizes the number of samples needed for parameter estimation. Surprisingly, such preprocessing is universal across a broad set of statistical models, which partly addresses a conjecture on optimal spectral estimators for rotationally invariant designs. Our principled approach vastly improves upon previous heuristic methods, including for designs common in computational imaging and genetics. The proposed methodology, based on approximate message passing, is broadly applicable and opens the way to the precise characterization of spiked matrices and of the corresponding spectral methods in a variety of settings.
翻译:我们考虑高维广义线性模型中的参数估计问题。通过合适数据依赖矩阵的主特征向量获得的谱方法提供了简单而令人惊讶有效的解决方案。然而,尽管其广泛应用,严格的性能表征以及数据预处理的原则性方法目前仅适用于非结构化(独立同分布的高斯和哈尔正交)设计。相比之下,现实世界的数据矩阵具有高度结构化特征,并呈现非平凡的相关性。为解决这一问题,我们考虑通过协方差矩阵$\Sigma$捕获特征各向异性性质的相关高斯设计。主要结果是对谱估计器性能的精确渐近表征。这使我们能够识别最小化参数估计所需样本数的最优预处理方法。令人惊讶的是,这种预处理在广泛的统计模型中具有普适性,这在一定程度上解决了关于旋转不变设计的最优谱估计器猜想。我们基于原则的方法极大改进了先前的启发式方法,包括计算成像和遗传学中常见的设计。所提出的基于近似消息传递的方法论具有广泛适用性,为在各种场景中精确表征尖峰矩阵及相应的谱方法开辟了道路。