Random feature ridge regression is often analyzed in the high-dimensional regime under the homogeneous sampling model $x_i=Σ^{1/2}x_i'$, where the vectors $x_i'$ have iid entries and the same covariance matrix $Σ$ is shared by all samples. In this paper, we move beyond this setting and study non-identically distributed data through a variance-profile model in which the training and test covariates have row-dependent diagonal covariance matrices $Σ_i=\diag(γ_{i1}^2,\ldots,γ_{ip}^2)$ and $\widetildeΣ_i=\diag(\tildeγ_{i1}^2,\ldots,\tildeγ_{ip}^2)$. Our main contribution is the derivation of asymptotic equivalents for the training and test risks of ridge regression with random features when $n$, $p$, and $m$ grow proportionally. The first set of equivalents is obtained by combining the linear-plus-chaos approximation with traffic-probability arguments, whereas the second set is deterministic and follows from operator-valued free probability through an amalgamation-over-the-diagonal argument. These equivalents are sharp in numerical experiments. They also reveal how heterogeneous variance profiles, including mixture-type profiles inspired by MNIST, can modify generalization and exhibit double-descent behavior when the ridge parameter is small.
翻译:随机特征岭回归常在齐次采样模型 $x_i=Σ^{1/2}x_i'$ 的高维框架下进行分析,其中向量 $x_i'$ 具有独立同分布元素,且所有样本共享相同协方差矩阵 $Σ$。本文突破该设定,通过方差剖面模型研究非独立同分布数据:训练与测试协变量分别具有行相关对角协方差矩阵 $Σ_i=\diag(γ_{i1}^2,\ldots,γ_{ip}^2)$ 和 $\widetildeΣ_i=\diag(\tildeγ_{i1}^2,\ldots,\tildeγ_{ip}^2)$。主要贡献在于推导了 $n$、$p$、$m$ 按比例增长时随机特征岭回归训练风险与测试风险的渐近等价表达式。第一组等价式通过线性加混沌逼近结合流量概率论证获得,第二组等价式为确定性结论,源自算子值自由概率通过对角线融合论证的推导。数值实验表明这些等价式具有精确性,并揭示了:包含受MNIST启发的混合型剖面在内的异质方差结构如何影响泛化能力,以及在岭参数较小时展现的双重下降行为。