Large language models (LLMs) demonstrate outstanding performance in various tasks in machine learning and have thus become one of the most important workloads in today's computing landscape. However, deploying LLM inference poses challenges due to the high compute and memory requirements stemming from the enormous model size and the difficulty of running it in the integer pipelines. In this paper, we present Tender, an algorithm-hardware co-design solution that enables efficient deployment of LLM inference at low precision. Based on our analysis of outlier values in LLMs, we propose a decomposed quantization technique in which the scale factors of decomposed matrices are powers of two apart. The proposed scheme allows us to avoid explicit requantization (i.e., dequantization/quantization) when accumulating the partial sums from the decomposed matrices, with a minimal extension to the commodity tensor compute hardware. Our evaluation shows that Tender achieves higher accuracy and inference performance compared to the state-of-the-art methods while also being significantly less intrusive to the existing accelerators.
翻译:大语言模型(LLM)在机器学习的各类任务中展现出卓越性能,已成为当前计算领域最重要的负载之一。然而,由于模型规模巨大导致的高计算与内存需求,以及其在整数计算流水线上运行的困难,部署LLM推理面临诸多挑战。本文提出Tender,一种算法-硬件协同设计解决方案,能够以低精度实现LLM推理的高效部署。基于对LLM中异常值的分析,我们提出一种分解量化技术,其中分解矩阵的缩放因子为二的幂次关系。该方案使得在累加分解矩阵的部分和时可避免显式的重量化(即反量化/量化)操作,仅需对现有通用张量计算硬件进行最小扩展。评估结果表明,与现有先进方法相比,Tender在实现更高精度与推理性能的同时,对现有加速器的侵入性也显著降低。