Large language models (LLMs) with long sequences begin to power more and more fundamentally new applications we use every day. Existing methods for long-sequence LLM training are neither efficient nor compatible with commonly-used training algorithms such as FlashAttention. We design InternEvo to address these issues. InternEvo decouples all of the sharding dimensions into a new hierarchical space, and systematically analyzes the memory and communication cost of LLM training. Then, it generates an effective hybrid parallelism strategy. We design a new selective overlap mechanism to mitigate the communication overhead introduced by the hybrid parallelism. We also implement memory management techniques to reduce GPU memory fragmentation. Evaluation results show that InternEvo generates parallelization strategies that match or outperform existing methods in model FLOPs utilization.
翻译:大语言模型(LLMs)凭借其长序列处理能力,正日益驱动着众多根本性的新型日常应用。现有长序列LLM训练方法在效率上存在不足,且与FlashAttention等主流训练算法兼容性欠佳。为此,我们设计了InternEvo框架。该框架将所有分片维度解耦至新型层次化空间,系统分析了LLM训练中的内存与通信成本,进而生成高效的混合并行策略。我们提出了一种选择性重叠机制,以缓解混合并行引入的通信开销,并通过内存管理技术减少GPU内存碎片化。评估结果表明,InternEvo生成的并行化策略在模型FLOPs利用率方面达到或超越了现有方法。