We consider the problem of training a multi-layer over-parametrized neural network to minimize the empirical risk induced by a loss function. In the typical setting of over-parametrization, the network width $m$ is much larger than the data dimension $d$ and the number of training samples $n$ ($m=\mathrm{poly}(n,d)$), which induces a prohibitive large weight matrix $W\in \mathbb{R}^{m\times m}$ per layer. Naively, one has to pay $O(m^2)$ time to read the weight matrix and evaluate the neural network function in both forward and backward computation. In this work, we show how to reduce the training cost per iteration. Specifically, we propose a framework that uses $m^2$ cost only in the initialization phase and achieves \emph{a truly subquadratic cost per iteration} in terms of $m$, i.e., $m^{2-\Omega(1)}$ per iteration. Our result has implications beyond standard over-parametrization theory, as it can be viewed as designing an efficient data structure on top of a pre-trained large model to further speed up the fine-tuning process, a core procedure to deploy large language models (LLM).
翻译:我们考虑训练多层过参数化神经网络以最小化损失函数引发的经验风险问题。在典型的过参数化场景中,网络宽度 $m$ 远大于数据维度 $d$ 和训练样本数量 $n$($m=\mathrm{poly}(n,d)$),导致每层产生规模过大的权重矩阵 $W\in \mathbb{R}^{m\times m}$。简单实现需要在正向和反向计算中花费 $O(m^2)$ 时间读取权重矩阵并评估神经网络函数。本研究提出降低每次迭代训练成本的方法:具体而言,我们构建了一个框架,仅在初始化阶段使用 $m^2$ 量级开销,而实现每次迭代在 $m$ 上的**真正次二次成本**,即 $m^{2-\Omega(1)}$。该结果超越标准过参数化理论范畴,可视为在预训练大模型之上设计高效数据结构以加速微调过程——这是部署大型语言模型(LLM)的核心流程。