Large Language Models (LLMs) have seen great advance in both academia and industry, and their popularity results in numerous open-source frameworks and techniques in accelerating LLM pre-training, fine-tuning, and inference. Training and deploying LLMs are expensive as it requires considerable computing resources and memory, hence many efficient approaches have been developed for improving system pipelines as well as operators. However, the runtime performance can vary significantly across hardware and software stacks, which makes it difficult to choose the best configuration. In this work, we aim to benchmark the performance from both macro and micro perspectives. First, we benchmark the end-to-end performance of pre-training, fine-tuning, and serving LLMs in different sizes , i.e., 7, 13, and 70 billion parameters (7B, 13B, and 70B) on three 8-GPU platforms with and without individual optimization techniques, including ZeRO, quantization, recomputation, FlashAttention. Then, we dive deeper to provide a detailed runtime analysis of the sub-modules, including computing and communication operators in LLMs. For end users, our benchmark and findings help better understand different optimization techniques, training and inference frameworks, together with hardware platforms in choosing configurations for deploying LLMs. For researchers, our in-depth module-wise analyses discover potential opportunities for future work to further optimize the runtime performance of LLMs.
翻译:大语言模型(LLMs)在学术界和工业界取得了显著进展,其普及催生了众多加速LLM预训练、微调和推理的开源框架与技术。由于训练和部署LLM需要大量计算资源和内存,成本高昂,因此研究者开发了许多高效方法来优化系统流水线和算子。然而,运行时性能在不同软硬件栈上差异显著,这使得选择最佳配置颇具挑战。本研究旨在从宏观与微观两个维度评估性能。首先,我们在三种8-GPU平台上,针对不同参数规模的LLM(7B、13B和70B),在启用与不启用ZeRO、量化、重计算、FlashAttention等单项优化技术的情况下,对预训练、微调和推理服务的端到端性能进行基准测试。随后,我们深入分析子模块的运行时性能,包括LLM中的计算与通信算子。对于终端用户,本研究的基准测试和发现有助于理解不同优化技术、训练与推理框架及硬件平台,从而选择适合LLM部署的配置。对于研究人员,我们深入的模块级分析揭示了未来进一步优化LLM运行时性能的潜在方向。