Large language models (LLMs) have made fundamental contributions over the last a few years. To train an LLM, one needs to alternatingly run `forward' computations and `backward' computations. The forward computation can be viewed as attention function evaluation, and the backward computation can be viewed as a gradient computation. In previous work by [Alman and Song, NeurIPS 2023], it was proved that the forward step can be performed in almost-linear time in certain parameter regimes, but that there is no truly sub-quadratic time algorithm in the remaining parameter regimes unless the popular hypothesis SETH is false. In this work, we show nearly identical results for the harder-seeming problem of computing the gradient of loss function of one layer attention network, and thus for the entire process of LLM training. This completely characterizes the fine-grained complexity of every step of LLM training.
翻译:大型语言模型(LLMs)在过去数年间取得了根本性贡献。训练LLM需要交替执行"前向"计算和"反向"计算。前向计算可视为注意力函数评估,反向计算则对应梯度计算。在[Alman and Song, NeurIPS 2023]的先前工作中,已证明前向步骤在特定参数范围内可实现近线性时间计算,但在其余参数范围内不存在真正次二次时间算法(除非公认假设SETH不成立)。本研究针对看似更困难的单层注意力网络损失函数梯度计算问题,给出了几乎相同的结果,进而完整刻画了LLM训练全过程的细粒度复杂度。这完全表征了LLM训练每个步骤的细粒度计算复杂性。