Large-language-models (LLMs) demonstrate enormous utility in long-context tasks which require processing prompts that consist of tens to hundreds of thousands of tokens. However, existing LLM training libraries do not provide easy to use abstractions to optimize for long-context training, instead focusing on optimizations for models with large parameter counts through ZeRO-3/FSDP, Tensor and Pipeline parallelism. This forces users to rewrite LLM training libraries to incorporate compositions of various complex long-context optimizations, such as sequence-parallelism, to training pipelines; a process that requires in-depth expertise, reducing developer productivity. To tackle these challenges, we introduce AutoSP: the first automated solution to automatically optimize LLM training for longer-contexts. AutoSP compiles models and applies a targeted set of optimizations: automated sequence parallelism, and long-context aware activation-checkpointing, to drastically enhance LLM trainability at negligible cost to throughput. Our evaluation demonstrates AutoSP's capability on both NVIDIA and AMD hardware, increasing training contexts by upto 2.7$\times$ and 2.5$\times$ respectively over competitive hand-written baseline at negligible cost to runtime performance.
翻译:[译摘要]大语言模型在处理需包含数万至数十万token的提示词的长上下文任务中展现出巨大效用。然而,现有的大语言模型训练库并未提供易于使用的抽象来优化长上下文训练,而是通过ZeRO-3/FSDP、张量并行与流水线并行聚焦于参数规模庞大的模型优化。这迫使用户重写大语言模型训练库,以将序列并行等复杂的长上下文优化组合融入训练流程;该过程需要深厚的专业知识,降低了开发效率。为解决这些挑战,我们提出AutoSP:首个自动优化长上下文大语言模型训练的自动化方案。AutoSP通过编译模型并应用一系列针对性优化——自动化序列并行与长上下文感知的激活检查点——在几乎不影响吞吐量的前提下显著提升大语言模型的可训练性。评估表明,AutoSP在NVIDIA与AMD硬件上均具备能力,相较于精心编写的手动基线,在运行时性能几乎无损耗的情况下,可分别将训练上下文扩展至2.7倍与2.5倍。