Benchmarking and In-depth Performance Study of Large Language Models on Habana Gaudi Processors

Transformer models have achieved remarkable success in various machine learning tasks but suffer from high computational complexity and resource requirements. The quadratic complexity of the self-attention mechanism further exacerbates these challenges when dealing with long sequences and large datasets. Specialized AI hardware accelerators, such as the Habana GAUDI architecture, offer a promising solution to tackle these issues. GAUDI features a Matrix Multiplication Engine (MME) and a cluster of fully programmable Tensor Processing Cores (TPC). This paper explores the untapped potential of using GAUDI processors to accelerate Transformer-based models, addressing key challenges in the process. Firstly, we provide a comprehensive performance comparison between the MME and TPC components, illuminating their relative strengths and weaknesses. Secondly, we explore strategies to optimize MME and TPC utilization, offering practical insights to enhance computational efficiency. Thirdly, we evaluate the performance of Transformers on GAUDI, particularly in handling long sequences and uncovering performance bottlenecks. Lastly, we evaluate the end-to-end performance of two Transformer-based large language models (LLM) on GAUDI. The contributions of this work encompass practical insights for practitioners and researchers alike. We delve into GAUDI's capabilities for Transformers through systematic profiling, analysis, and optimization exploration. Our study bridges a research gap and offers a roadmap for optimizing Transformer-based model training on the GAUDI architecture.

翻译：Transformer模型已在多种机器学习任务中取得显著成功，但面临计算复杂度高和资源需求大的挑战。自注意力机制的二次复杂度进一步加剧了处理长序列和大规模数据集时的困难。专用AI硬件加速器（如Habana GAUDI架构）为解决这些问题提供了有前景的方案。GAUDI配备了矩阵乘法引擎（MME）和可完全编程的张量处理核心（TPC）集群。本文探索了利用GAUDI处理器加速基于Transformer的模型的未开发潜力，并解决了过程中的关键挑战。首先，我们对MME与TPC组件进行了全面的性能对比，揭示了它们各自的优势与不足。其次，我们研究了优化MME和TPC利用率的策略，为提升计算效率提供了实用见解。第三，我们评估了Transformer在GAUDI上的性能，特别是处理长序列时的表现，并揭示了性能瓶颈。最后，我们评估了两种基于Transformer的大型语言模型（LLM）在GAUDI上的端到端性能。本工作的贡献包括为从业者和研究人员提供实用见解。通过系统化的性能剖析、分析与优化探索，我们深入研究了GAUDI对Transformer的支持能力。我们的研究填补了现有空白，并为在GAUDI架构上优化基于Transformer的模型训练提供了路线图。