Near Optimal Heteroscedastic Regression with Symbiotic Learning

We consider the classical problem of heteroscedastic linear regression, where we are given $n$ samples $(\mathbf{x}_i, y_i) \in \mathbb{R}^d \times \mathbb{R}$ obtained from $y_i = \langle \mathbf{w}^{*}, \mathbf{x}_i \rangle + \epsilon_i \cdot \langle \mathbf{f}^{*}, \mathbf{x}_i \rangle$, where $\mathbf{x}_i \sim N(0,\mathbf{I})$, $\epsilon_i \sim N(0,1)$, and our task is to estimate $\mathbf{w}^{*}$. In addition to the classical applications of heteroscedastic models in fields such as statistics, econometrics, time series analysis etc., it is also particularly relevant in machine learning when data is collected from multiple sources of varying but apriori unknown quality, e.g., large model training. Our work shows that we can estimate $\mathbf{w}^{*}$ in squared norm up to an error of $\tilde{O}\left(\|\mathbf{f}^{*}\|^2 \cdot \left(\frac{1}{n} + \left(\frac{d}{n}\right)^2\right)\right)$ and prove a matching lower bound (up to logarithmic factors). Our result substantially improves upon the previous best known upper bound of $\tilde{O}\left(\|\mathbf{f}^{*}\|^2\cdot \frac{d}{n}\right)$. Our upper bound result is based on a novel analysis of a simple, classical heuristic going back to at least Davidian and Carroll (1987) and constitutes the first non-asymptotic convergence guarantee for this approach. As a byproduct, our analysis also provides improved rates of estimation for both linear regression and phase retrieval with multiplicative noise, which maybe of independent interest. The lower bound result relies on a careful application of LeCam's two point method, adapted to work with heavy tailed random variables where the relevant mutual information quantities are infinite (precluding a direct application of LeCam's method), and could also be of broader interest.

翻译：我们考虑异方差线性回归的经典问题，其中给定$n$个样本$(\mathbf{x}_i, y_i) \in \mathbb{R}^d \times \mathbb{R}$，满足 $y_i = \langle \mathbf{w}^{*}, \mathbf{x}_i \rangle + \epsilon_i \cdot \langle \mathbf{f}^{*}, \mathbf{x}_i \rangle$，这里 $\mathbf{x}_i \sim N(0,\mathbf{I})$，$\epsilon_i \sim N(0,1)$，任务是估计$\mathbf{w}^{*}$。除异方差模型在统计学、计量经济学、时间序列分析等领域的经典应用外，该问题在机器学习从多个先验未知质量来源（如大模型训练）收集数据时尤为相关。本研究表明，我们可在平方范数意义下将$\mathbf{w}^{*}$的估计误差控制在$\tilde{O}\left(\|\mathbf{f}^{*}\|^2 \cdot \left(\frac{1}{n} + \left(\frac{d}{n}\right)^2\right)\right)$内，并证明匹配的下界（对数因子除外）。我们的结果显著改进了先前已知的最优上界$\tilde{O}\left(\|\mathbf{f}^{*}\|^2\cdot \frac{d}{n}\right)$。上界结果基于对至少可追溯至Davidian和Carroll（1987）的简单经典启发式方法的新颖分析，是该方法的首个非渐近收敛性保证。作为副产品，我们的分析还改进了乘法噪声下的线性回归和相位恢复的估计速率，这可能具有独立研究价值。下界结果依赖于对LeCam两点法的精细改编，使其适用于相关互信息量为无穷大的重尾随机变量场景（从而无法直接应用LeCam方法），这一改编可能具有更广泛的参考意义。