Near Optimal Heteroscedastic Regression with Symbiotic Learning

We consider the problem of heteroscedastic linear regression, where, given $n$ samples $(\mathbf{x}_i, y_i)$ from $y_i = \langle \mathbf{w}^{*}, \mathbf{x}_i \rangle + \epsilon_i \cdot \langle \mathbf{f}^{*}, \mathbf{x}_i \rangle$ with $\mathbf{x}_i \sim N(0,\mathbf{I})$, $\epsilon_i \sim N(0,1)$, we aim to estimate $\mathbf{w}^{*}$. Beyond classical applications of such models in statistics, econometrics, time series analysis etc., it is also particularly relevant in machine learning when data is collected from multiple sources of varying but apriori unknown quality. Our work shows that we can estimate $\mathbf{w}^{*}$ in squared norm up to an error of $\tilde{O}\left(\|\mathbf{f}^{*}\|^2 \cdot \left(\frac{1}{n} + \left(\frac{d}{n}\right)^2\right)\right)$ and prove a matching lower bound (upto log factors). This represents a substantial improvement upon the previous best known upper bound of $\tilde{O}\left(\|\mathbf{f}^{*}\|^2\cdot \frac{d}{n}\right)$. Our algorithm is an alternating minimization procedure with two key subroutines 1. An adaptation of the classical weighted least squares heuristic to estimate $\mathbf{w}^{*}$, for which we provide the first non-asymptotic guarantee. 2. A nonconvex pseudogradient descent procedure for estimating $\mathbf{f}^{*}$ inspired by phase retrieval. As corollaries, we obtain fast non-asymptotic rates for two important problems, linear regression with multiplicative noise and phase retrieval with multiplicative noise, both of which are of independent interest. Beyond this, the proof of our lower bound, which involves a novel adaptation of LeCam's method for handling infinite mutual information quantities (thereby preventing a direct application of standard techniques like Fano's method), could also be of broader interest for establishing lower bounds for other heteroscedastic or heavy-tailed statistical problems.

翻译：我们考虑异方差线性回归问题：给定来自模型 $y_i = \langle \mathbf{w}^{*}, \mathbf{x}_i \rangle + \epsilon_i \cdot \langle \mathbf{f}^{*}, \mathbf{x}_i \rangle$ 的 $n$ 个样本 $(\mathbf{x}_i, y_i)$，其中 $\mathbf{x}_i \sim N(0,\mathbf{I})$，$\epsilon_i \sim N(0,1)$，目标为估计 $\mathbf{w}^{*}$。除在统计学、计量经济学、时间序列分析等领域的经典应用外，该模型在机器学习中亦具特殊相关性——当数据来自多个质量各异且先验未知的来源时。我们的工作表明：可在平方范数意义下以误差 $\tilde{O}\left(\|\mathbf{f}^{*}\|^2 \cdot \left(\frac{1}{n} + \left(\frac{d}{n}\right)^2\right)\right)$ 估计 $\mathbf{w}^{*}$，并证明匹配的下界（至多对数因子）。相较于先前已知最优上界 $\tilde{O}\left(\|\mathbf{f}^{*}\|^2\cdot \frac{d}{n}\right)$，这构成了本质改进。我们的算法采用交替最小化过程，包含两个关键子例程：(1) 经典加权最小二乘启发式方法的适配版以估计 $\mathbf{w}^{*}$，首次给出其非渐近性保障；(2) 受相位恢复启发的非凸伪梯度下降法以估计 $\mathbf{f}^{*}$。作为推论，我们获得两个重要问题——乘性噪声线性回归与乘性噪声相位恢复——的快速非渐近速率，二者本身即具独立研究价值。此外，下界证明中创新性地适配了LeCam方法以处理无限互信息量（从而规避Fano方法等标准技术的直接应用），这或将为其他异方差或重尾统计问题的下界建立提供更广泛的参考价值。