For overparameterized linear regression with isotropic Gaussian design and minimum-$\ell_p$ interpolator $p\in(1,2]$, we give a unified, high-probability characterization for the scaling of the family of parameter norms $ \\{ \lVert \widehat{w_p} \rVert_r \\}_{r \in [1,p]} $ with sample size. We solve this basic, but unresolved question through a simple dual-ray analysis, which reveals a competition between a signal *spike* and a *bulk* of null coordinates in $X^\top Y$, yielding closed-form predictions for (i) a data-dependent transition $n_\star$ (the "elbow"), and (ii) a universal threshold $r_\star=2(p-1)$ that separates $\lVert \widehat{w_p} \rVert_r$'s which plateau from those that continue to grow with an explicit exponent. This unified solution resolves the scaling of *all* $\ell_r$ norms within the family $r\in [1,p]$ under $\ell_p$-biased interpolation, and explains in one picture which norms saturate and which increase as $n$ grows. We then study diagonal linear networks (DLNs) trained by gradient descent. By calibrating the initialization scale $α$ to an effective $p_{\mathrm{eff}}(α)$ via the DLN separable potential, we show empirically that DLNs inherit the same elbow/threshold laws, providing a predictive bridge between explicit and implicit bias. Given that many generalization proxies depend on $\lVert \widehat {w_p} \rVert_r$, our results suggest that their predictive power will depend sensitively on which $l_r$ norm is used.
翻译:针对各向同性高斯设计且采用最小$\ell_p$插值($p\in(1,2]$)的过参数化线性回归问题,我们给出了参数范数族$ \{ \lVert \widehat{w_p} \rVert_r \}_{r \in [1,p]} $随样本量缩放的高度概率化统一刻画。通过简单的对偶射线分析,我们解决了这一基础但悬而未决的问题,揭示了$X^\top Y$中信号"尖峰"与零坐标"体"之间的竞争机制,从而得到以下预测的闭式表达式:(i) 数据依赖的转变点$n_\star$(即"拐点");(ii) 区分$\lVert \widehat{w_p} \rVert_r$是否达到平稳的普适阈值$r_\star=2(p-1)$——低于该阈值的范数持续增长且具有显式指数。这一统一解阐明了在$\ell_p$偏置插值下族系$r\in [1,p]$中**所有**$\ell_r$范数的缩放规律,并直观解释了随着$n$增大哪些范数趋于饱和、哪些持续增长。我们进一步研究由梯度下降训练的对角线性网络(DLNs)。通过将初始化尺度$\alpha$校准为DLN可分势函数作用下的有效$p_{\mathrm{eff}}(\alpha)$,实验证明DLNs继承了相同的拐点/阈值法则,从而构建了显式偏置与隐式偏置之间的预测桥梁。鉴于许多泛化代理量依赖于$\lVert \widehat{w_p} \rVert_r$,我们的结果表明其预测能力将敏感地取决于所采用的$\ell_r$范数类型。