We introduce Mini-Sequence Transformer (MsT), a simple and effective methodology for highly efficient and accurate LLM training with extremely long sequences. MsT partitions input sequences and iteratively processes mini-sequences to reduce intermediate memory usage. Integrated with activation recomputation, it enables significant memory savings in both forward and backward passes. In experiments with the Llama3-8B model, with MsT, we measure no degradation in throughput or convergence even with 12x longer sequences than standard implementations. MsT is fully general, implementation-agnostic, and requires minimal code changes to integrate with existing LLM training frameworks. Integrated with the huggingface library, MsT successfully extends the maximum context length of Qwen, Mistral, and Gemma-2 by 12-24x.
翻译:本文提出微型序列Transformer(MsT),一种简单高效的方法,用于在极长序列下实现高效且精确的大语言模型训练。MsT将输入序列分割,并通过迭代处理微型序列来降低中间内存占用。结合激活重计算技术,该方法能在前向传播与反向传播过程中实现显著的内存节省。在Llama3-8B模型上的实验表明,即使处理比标准实现长12倍的序列,采用MsT也未观察到吞吐量或收敛性能的下降。MsT具有完全的通用性,与具体实现无关,仅需极少的代码修改即可集成到现有的大语言模型训练框架中。通过与huggingface库集成,MsT成功将Qwen、Mistral和Gemma-2模型的最大上下文长度扩展了12至24倍。