Feature learning, i.e. extracting meaningful representations of data, is quintessential to the practical success of neural networks trained with gradient descent, yet it is notoriously difficult to explain how and why it occurs. Recent theoretical studies have shown that shallow neural networks optimized on a single task with gradient-based methods can learn meaningful features, extending our understanding beyond the neural tangent kernel or random feature regime in which negligible feature learning occurs. But in practice, neural networks are increasingly often trained on {\em many} tasks simultaneously with differing loss functions, and these prior analyses do not generalize to such settings. In the multi-task learning setting, a variety of studies have shown effective feature learning by simple linear models. However, multi-task learning via {\em nonlinear} models, arguably the most common learning paradigm in practice, remains largely mysterious. In this work, we present the first results proving feature learning occurs in a multi-task setting with a nonlinear model. We show that when the tasks are binary classification problems with labels depending on only $r$ directions within the ambient $d\gg r$-dimensional input space, executing a simple gradient-based multitask learning algorithm on a two-layer ReLU neural network learns the ground-truth $r$ directions. In particular, any downstream task on the $r$ ground-truth coordinates can be solved by learning a linear classifier with sample and neuron complexity independent of the ambient dimension $d$, while a random feature model requires exponential complexity in $d$ for such a guarantee.
翻译:特征学习,即从数据中提取有意义的表示,是梯度下降训练的神经网络实际成功的关键,但解释其发生机制和原因却极其困难。近期理论研究表明,基于梯度方法的单任务优化浅层神经网络能够学习有意义的特征,这拓展了我们对神经正切核或随机特征机制(其中特征学习可忽略不计)的理解。然而在实践中,神经网络越来越多地通过不同损失函数同时训练多种任务,而先前的分析方法无法推广至此类场景。在多任务学习设定下,大量研究展示了简单线性模型的有效特征学习能力。但通过非线性模型(实践中最为常见的学习范式)进行多任务学习仍是一个重大谜题。本文首次证明了非线性模型在多任务场景中能够实现特征学习。我们证明:当任务为二分类问题且标签仅依赖于环境维度d≫r输入空间中的r个方向时,在双层ReLU神经网络上执行简单的基于梯度的多任务学习算法能够学习到真实的r个方向。特别地,任何基于这r个真实坐标的下游任务均可通过学习线性分类器解决,其样本与神经元复杂度与环境维度d无关,而随机特征模型要达到同样保证则需d的指数级复杂度。