Value level parallelism (VLP) has been proposed to improve the efficiency of large-batch, low-precision general matrix multiply (GEMM) between symmetric activations and weights. In transformer based large language models (LLMs), there exist more sophisticated operations beyond activation-weight GEMM. In this paper, we explore how VLP benefits LLMs. First, we generalize VLP for nonlinear approximations, outperforming existing nonlinear approximations in end-to-end LLM accuracy, performance, and efficiency. Our VLP approximation follows a value-centric approach, where important values are assigned with greater accuracy. Second, we optimize VLP for small-batch GEMMs with asymmetric inputs efficiently, which leverages timely LLM optimizations, including weight-only quantization, key-value (KV) cache quantization, and group query attention. Finally, we design a new VLP architecture, Mugi, to encapsulate the innovations above and support full LLM workloads, while providing better performance, efficiency and sustainability. Our experimental results show that Mugi can offer significant improvements on throughput and energy efficiency, up to $45\times$ and $668\times$ for nonlinear softmax operations, and $2.07\times$ and $3.11\times$ for LLMs, and also decrease operational carbon for LLM operation by $1.45\times$ and embodied carbon by $1.48\times$.
翻译:值级并行(VLP)技术被提出用于提升对称激活与权重间大批次、低精度通用矩阵乘法(GEMM)的效率。在基于Transformer的大语言模型(LLMs)中,存在比激活-权重GEMM更为复杂的运算。本文探讨了VLP如何使LLMs受益。首先,我们将VLP推广至非线性近似计算,在端到端LLM的精度、性能与效率方面均优于现有非线性近似方法。我们的VLP近似遵循以值为中心的思路,对重要数值赋予更高计算精度。其次,我们针对非对称输入的小批次GEMM高效优化了VLP,该优化充分利用了时下LLM的前沿技术,包括仅权重量化、键值(KV)缓存量化与分组查询注意力机制。最后,我们设计了一种新型VLP架构——Mugi,以整合上述创新并支持完整的LLM工作负载,同时提供更优的性能、效率与可持续性。实验结果表明,Mugi能显著提升吞吐量与能效:对于非线性softmax运算分别可达$45\times$与$668\times$,对于LLMs分别可达$2.07\times$与$3.11\times$,同时将LLM运行碳排放降低$1.45\times$,隐含碳排放降低$1.48\times$。