As Large Language Models (LLMs) achieve remarkable progress in language understanding and generation, their training efficiency has become a critical concern. Traditionally, LLMs are trained to predict the next token in a sequence. Despite the success of token-level training, it suffers from considerable computational costs due to the need to process an extensive number of tokens. To mitigate this issue, this paper introduces patch-level training for LLMs, which reduces the sequence length by compressing multiple tokens into a single patch. During patch-level training, we feed the language model shorter sequences of patches and train it to predict the next patch, thereby processing the majority of the training data at a significantly reduced computational cost. Following this, the model continues token-level training on the remaining training data to align with the inference mode. Experiments on a diverse range of models (370M-2.7B parameters) demonstrate that patch-level training can reduce overall computational costs to 0.5$\times$, without compromising the model performance compared to token-level training. Source code: \url{https://github.com/shaochenze/PatchTrain}.
翻译:随着大语言模型在语言理解与生成方面取得显著进展,其训练效率已成为关键问题。传统上,大语言模型通过预测序列中的下一个词元进行训练。尽管词元级训练取得了成功,但由于需要处理海量词元,其计算成本相当高昂。为缓解这一问题,本文提出了面向大语言模型的补丁级训练方法,该方法通过将多个词元压缩为单个补丁来缩短序列长度。在补丁级训练阶段,我们向语言模型输入较短的补丁序列并训练其预测下一个补丁,从而以显著降低的计算成本处理大部分训练数据。随后,模型在剩余训练数据上继续进行词元级训练以保持与推理模式的一致性。在多种参数规模(370M-2.7B)模型上的实验表明,补丁级训练能将整体计算成本降低至0.5$\times$,同时保持与词元级训练相当的模型性能。源代码:\url{https://github.com/shaochenze/PatchTrain}。