Value level parallelism (VLP) has been proposed to improve the efficiency of large-batch, low-precision general matrix multiply (GEMM) between symmetric activations and weights. In transformer based large language models (LLMs), there exist more sophisticated operations beyond activation-weight GEMM. In this paper, we explore how VLP benefits LLMs. First, we generalize VLP for nonlinear approximations, outperforming existing nonlinear approximations in end-to-end LLM accuracy, performance, and efficiency. Our VLP approximation follows a value-centric approach, where important values are assigned with greater accuracy. Second, we optimize VLP for small-batch GEMMs with asymmetric inputs efficiently, which leverages timely LLM optimizations, including weight-only quantization, key-value (KV) cache quantization, and group query attention. Finally, we design a new VLP architecture, Mugi, to encapsulate the innovations above and support full LLM workloads, while providing better performance, efficiency and sustainability. Our experimental results show that Mugi can offer significant improvements on throughput and energy efficiency, up to $45\times$ and $668\times$ for nonlinear softmax operations, and $2.07\times$ and $3.11\times$ for LLMs, and also decrease operational carbon for LLM operation by $1.45\times$ and embodied carbon by $1.48\times$.
翻译:值级并行(VLP)技术已被提出,旨在提升对称激活与权重之间大批量、低精度通用矩阵乘法(GEMM)的效率。在基于Transformer的大语言模型(LLMs)中,存在比激活-权重GEMM更为复杂的运算。本文探讨了VLP如何使LLMs受益。首先,我们将VLP推广至非线性近似计算,在端到端LLM的精度、性能与效率方面均超越了现有非线性近似方法。我们的VLP近似遵循以值为中心的设计思路,对重要数值赋予更高计算精度。其次,我们针对非对称输入的小批量GEMM高效优化了VLP,该优化充分利用了当前LLM领域的多项及时优化技术,包括仅权重量化、键值(KV)缓存量化以及分组查询注意力机制。最后,我们设计了一种全新的VLP架构——Mugi,以封装上述创新并支持完整的LLM工作负载,同时提供更优的性能、效率与可持续性。实验结果表明,Mugi能显著提升吞吐量与能效:对于非线性softmax运算分别提升达$45\times$和$668\times$,对于完整LLM分别提升$2.07\times$和$3.11\times$;同时将LLM运行的操作碳排放降低$1.45\times$,隐含碳排放降低$1.48\times$。