ASAP: an Agentic Solution to Auto-optimize Performance of Large-Scale LLM Training

Optimizing large-language model (LLM) training on distributed domain-specific accelerator systems presents significant challenges due to its complex optimization space. Existing optimization methods, however, rely on time-consuming manual tuning or resource-intensive black-box searches, which struggle to keep pace with the rapidly evolving LLM domain, leading to slow development and underutilized resources. To address this, we introduce ASAP, an Agentic Solution to Auto-optimize Performance of Large-Scale LLM Training. It is a multi-agent system, featuring Coordinator, Analyzer, and Proposal agents, which integrates LLM reasoning with insights from performance profiling tools, roofline analysis, and a knowledge base of best practices and successful past optimizations from human experts. Our proposed design can automate the diagnosis of performance bottlenecks and recommend optimized sharding configurations with reasoning, thus effectively improving the efficiency of distributed LLM training. Experiments have shown that the ASAP-generated sharding configurations can contribute up to 28% training step time reduction and 1.43 times throughput improvement. When combined with additional optimization from human experts, throughput can be further increased to 2.58 times. The proposed ASAP promises to provide a scalable and explainable methodology for AI-assisted performance engineering in large-scale LLM training.

翻译：在分布式领域专用加速器系统上优化大规模语言模型（LLM）训练，因其复杂的优化空间而面临重大挑战。然而，现有的优化方法依赖于耗时的手动调优或资源密集的黑盒搜索，难以跟上快速发展的LLM领域步伐，导致开发缓慢和资源利用不足。为解决此问题，我们提出了ASAP（一种用于自动优化大规模语言模型训练性能的智能体解决方案）。它是一个多智能体系统，包含协调器、分析器和建议器智能体，该系统将LLM推理与性能分析工具、屋顶线分析以及来自人类专家的最佳实践和过往成功优化案例的知识库相结合。我们提出的设计能够自动诊断性能瓶颈，并通过推理推荐优化的分片配置，从而有效提升分布式LLM训练的效率。实验表明，ASAP生成的分片配置可贡献高达28%的训练步长时间缩减和1.43倍的吞吐量提升。当结合人类专家的额外优化时，吞吐量可进一步提升至2.58倍。所提出的ASAP有望为大规模LLM训练中的AI辅助性能工程提供一种可扩展且可解释的方法论。

相关内容

ASAP

关注 0

ASAP：Application-Specific Systems, Architectures, and Processors。 Explanation：特定于应用程序的系统、体系结构和处理器。 Publisher： IEEE。 SIT：http://dblp.uni-trier.de/db/conf/asap

【NeurIPS2023】基于反事实保守Q学习的离线多智能体强化学习

专知会员服务

17+阅读 · 2023年9月25日

《用于代码弱点识别的 LLVM 中间表示》CMU

专知会员服务

15+阅读 · 2022年12月12日

【NeurIPS2022】SparCL:边缘稀疏持续学习

专知会员服务

24+阅读 · 2022年9月22日

【KDD2022】掩码与推理: 用于复杂逻辑查询的预训练知识图谱Transformers

专知会员服务

29+阅读 · 2022年8月12日