Matrix-vector multiplication is a fundamental building block in neural networks, vector databases, and large language models, particularly during inference. As a result, efficient matrix-vector multiplication engines directly translate into more efficient inference. Recent work has explored low-bit quantization of model weights, where matrices are represented using binary (1-bit) or ternary (1.58-bit) values while activation is kept in higher precision. These representations enable efficient hardware-level computation. In parallel, algorithms such as Redundant Segment Reduction (RSR) provide theoretical guarantees for accelerating low-bit matrix-vector multiplication. However, existing implementations operate at the application level and cannot be efficiently integrated into hardware kernels, limiting practical performance. To bridge this gap, we present RSR-core, a high-performance engine that implements the RSR algorithm as optimized low-level kernels for both CPU and CUDA environments. RSR-core supports efficient matrix-vector multiplication for binary and ternary weight matrices and general vectors while enabling practical deployment of RSR algorithm in real inference pipelines. RSR-core is provided as a production-ready engine with HuggingFace integration for preprocessing low-bit models and running accelerated inference. Experimental results demonstrate significant performance improvements over baseline HuggingFace PyTorch multiplication, achieving up to 62x speedup on CPU and up to 1.9x speedup for token generation on CUDA for popular ternary LLMs. The source code is publicly available at https://github.com/UIC-InDeXLab/RSR-core.
翻译:矩阵-向量乘法是神经网络、向量数据库及大语言模型(尤其在推理阶段)中的基本构建模块。因此,高效的矩阵-向量乘法引擎能够直接提升推理效率。近期研究探索了模型权重的低比特量化技术:将矩阵表示为二进制(1比特)或三进制(1.58比特)数值,同时保持激活值的高精度表示。这种表示方式支持高效的硬件级计算。与此并行的是,冗余分段归约(Redundant Segment Reduction, RSR)等算法为加速低比特矩阵-向量乘法提供了理论保障。然而,现有实现停留在应用层,无法高效集成至硬件内核中,从而限制了实际性能。为弥补这一差距,我们提出RSR-core——一种高性能引擎,该引擎将RSR算法实现为针对CPU和CUDA环境优化的底层内核,支持二进制与三进制权重矩阵与通用向量的高效矩阵-向量乘法,并支持RSR算法在实际推理流水线中的实用化部署。RSR-core作为生产级引擎提供,通过集成HuggingFace实现低比特模型的预处理与加速推理。实验结果表明,与基准HuggingFace PyTorch乘法相比,该引擎在流行三进制大语言模型上实现了显著性能提升:CPU端加速比最高达62倍,CUDA端令牌生成加速比最高达1.9倍。源代码已开源发布于 https://github.com/UIC-InDeXLab/RSR-core。