In a mixed generalized linear model, the objective is to learn multiple signals from unlabeled observations: each sample comes from exactly one signal, but it is not known which one. We consider the prototypical problem of estimating two statistically independent signals in a mixed generalized linear model with Gaussian covariates. Spectral methods are a popular class of estimators which output the top two eigenvectors of a suitable data-dependent matrix. However, despite the wide applicability, their design is still obtained via heuristic considerations, and the number of samples $n$ needed to guarantee recovery is super-linear in the signal dimension $d$. In this paper, we develop exact asymptotics on spectral methods in the challenging proportional regime in which $n, d$ grow large and their ratio converges to a finite constant. By doing so, we are able to optimize the design of the spectral method, and combine it with a simple linear estimator, in order to minimize the estimation error. Our characterization exploits a mix of tools from random matrices, free probability and the theory of approximate message passing algorithms. Numerical simulations for mixed linear regression and phase retrieval demonstrate the advantage enabled by our analysis over existing designs of spectral methods.
翻译:在混合广义线性模型中,目标是从未标记的观测中学习多个信号:每个样本仅来自某一个信号,但未知其具体归属。我们考虑高斯协变量下混合广义线性模型中估计两个统计独立信号的典型问题。谱方法是一类常用的估计器,其输出特定数据依赖矩阵的前两个特征向量。然而,尽管应用广泛,其设计仍基于启发式考虑,且保证恢复所需的样本数$n$在信号维度$d$上呈超线性增长。本文针对$n,d$同步增大且其比值收敛于有限常数的具有挑战性的比例渐近区域,建立了谱方法的精确渐近理论。通过此框架,我们能够优化谱方法的设计,并将其与简单线性估计器结合以最小化估计误差。我们的刻画综合运用了随机矩阵、自由概率论及近似消息传递算法理论中的工具。针对混合线性回归与相位恢复的数值模拟表明,基于我们分析设计的谱方法相比现有方案具有显著优势。