In a mixed generalized linear model, the goal is to learn multiple signals from unlabeled observations: each sample comes from exactly one signal, but it is not known which one. We consider the prototypical problem of estimating two statistically independent signals in a mixed generalized linear model with Gaussian covariates. Spectral methods are a popular class of estimators which output the top two eigenvectors of a suitable data-dependent matrix. However, despite the wide applicability, their design is still obtained via heuristic considerations, and the number of samples $n$ needed to guarantee recovery is super-linear in the signal dimension $d$. In this paper, we develop exact asymptotics on spectral methods in the challenging proportional regime in which $n, d$ grow large and their ratio converges to a finite constant. This allows us optimize the design of the spectral method, and combine it with a simple linear estimator, to minimize the estimation error. Our characterization exploits a mix of tools from random matrices, free probability and the theory of approximate message passing algorithms. Numerical simulations for mixed linear regression and phase retrieval demonstrate the advantage enabled by our analysis over existing designs of spectral methods.
翻译:在混合广义线性模型中,目标是从未标记的观测中学习多个信号:每个样本恰好来自一个信号,但具体属于哪一个信号是未知的。我们考虑在高斯协变量下估计混合广义线性模型中两个统计独立信号的原型问题。谱方法是一类流行的估计器,其输出某个合适的数据依赖矩阵的前两个特征向量。然而,尽管应用广泛,其设计仍基于启发式考量,并且保证恢复所需样本数 $n$ 相对于信号维度 $d$ 是超线性的。本文在具有挑战性的比例区域中,即当 $n$ 和 $d$ 均趋于无穷且其比值收敛于某有限常数时,建立了谱方法的精确渐近性质。这使我们能够优化谱方法的设计,并将其与一个简单的线性估计器相结合,以最小化估计误差。我们的刻画利用了来自随机矩阵、自由概率论以及近似消息传递算法理论的混合工具。针对混合线性回归和相位恢复的数值模拟表明,基于我们分析所实现的设计优于现有的谱方法设计方案。