A powerful concept behind much of the recent progress in machine learning is the extraction of common features across data from heterogeneous sources or tasks. Intuitively, using all of one's data to learn a common representation function benefits both computational effort and statistical generalization by leaving a smaller number of parameters to fine-tune on a given task. Toward theoretically grounding these merits, we propose a general setting of recovering linear operators $M$ from noisy vector measurements $y = Mx + w$, where the covariates $x$ may be both non-i.i.d. and non-isotropic. We demonstrate that existing isotropy-agnostic representation learning approaches incur biases on the representation update, which causes the scaling of the noise terms to lose favorable dependence on the number of source tasks. This in turn can cause the sample complexity of representation learning to be bottlenecked by the single-task data size. We introduce an adaptation, $\texttt{De-bias & Feature-Whiten}$ ($\texttt{DFW}$), of the popular alternating minimization-descent scheme proposed independently in Collins et al., (2021) and Nayer and Vaswani (2022), and establish linear convergence to the optimal representation with noise level scaling down with the $\textit{total}$ source data size. This leads to generalization bounds on the same order as an oracle empirical risk minimizer. We verify the vital importance of $\texttt{DFW}$ on various numerical simulations. In particular, we show that vanilla alternating-minimization descent fails catastrophically even for iid, but mildly non-isotropic data. Our analysis unifies and generalizes prior work, and provides a flexible framework for a wider range of applications, such as in controls and dynamical systems.
翻译:机器学习近期进展背后的一个重要理念是从异构来源或任务的数据中提取共同特征。直观而言,利用所有数据学习一个共同的表示函数,既能减少计算开销,又能通过降低特定任务上需要微调的参数量来提升统计泛化能力。为从理论上夯实这些优势,我们提出一个从含噪向量测量值 $y = Mx + w$ 中恢复线性算子 $M$ 的一般性框架,其中协变量 $x$ 可能同时具有非独立同分布和非各向同性的特性。我们证明,现有忽略各向同性的表示学习方法会在表示更新中引入偏差,导致噪声项的缩放无法保持对源任务数量的有利依赖关系。这进而可能使表示学习的样本复杂度受限于单任务数据规模。我们引入一种改进方法——$\texttt{去偏与特征白化}$($\texttt{DFW}$),该方法是 Collins 等人(2021)与 Nayer 和 Vaswani(2022)分别提出的流行交替最小化-下降方案的改进版本,并证明其能以噪声水平随$\textit{总体}$源数据量增加而下降的方式线性收敛至最优表示。由此得到的泛化误差界与经验风险最小化器的理论最优阶相同。我们通过多种数值模拟验证了 $\texttt{DFW}$ 方法的关键重要性。特别地,我们证明即使对于独立同分布但轻度非各向同性的数据,原始交替最小化-下降方法也会出现灾难性失败。我们的分析统一并推广了先前工作,为控制和动力系统等更广泛的应用领域提供了一个灵活的理论框架。