This paper investigates the problem of quantized matrix multiplication (MatMul), which has become crucial for the efficient deployment of large language models (LLMs). We consider a Generic MatMul setting, where both matrices must be quantized (weight+activation quantization) without specific apriori (calibration) statistical information about the factors. We review the fundamental information-theoretic tradeoff between quantization rate and distortion (high-rate theory), and contrast those with the performance of popular quantization schemes (absmax INT and floating-point (FP)), for which we also derive accurate heuristic approximations. Part II of this paper studies the weight-only quantization setup where second-order statistics of the activation matrices are available at the encoder.
翻译:本文研究量化矩阵乘法(MatMul)问题,该问题对于大型语言模型(LLM)的高效部署至关重要。我们考虑通用型MatMul场景,其中两个矩阵均需进行量化(权重+激活量化),且不预先掌握关于因子的特定先验(校准)统计信息。我们回顾了量化速率与失真之间的基本信息论权衡(高率理论),并将其与流行的量化方案(absmax整数型和浮点数型(FP))的性能进行对比,同时为后者推导了精确的启发式近似。本文第二部分将研究仅权重量化场景,其中编码器可获得激活矩阵的二阶统计信息。