KANtize: Exploring Low-bit Quantization of Kolmogorov-Arnold Networks for Efficient Inference

Kolmogorov-Arnold Networks (KANs) have gained attention for their potential to outperform Multi-Layer Perceptrons (MLPs) in terms of parameter efficiency and interpretability. Unlike traditional MLPs, KANs use learnable non-linear activation functions, typically spline functions, expressed as linear combinations of basis splines (B-splines). B-spline coefficients serve as the model's learnable parameters. However, evaluating these spline functions increases computational complexity during inference. Conventional quantization reduces this complexity by lowering the numerical precision of parameters and activations. However, the impact of quantization on KANs, and especially its effectiveness in reducing computational complexity, is largely unexplored, particularly for quantization levels below 8 bits. The study investigates the impact of low-bit quantization on KANs and its impact on computational complexity and hardware efficiency. Results show that B-splines can be quantized to 2-3 bits with negligible loss in accuracy, significantly reducing computational complexity. Hence, we investigate the potential of using low-bit quantized precomputed tables as a replacement for the recursive B-spline algorithm. This approach aims to further reduce the computational complexity of KANs and enhance hardware efficiency while maintaining accuracy. For example, ResKAN18 achieves a 50x reduction in BitOps without loss of accuracy using low-bit-quantized B-spline tables. Additionally, precomputed 8-bit lookup tables improve GPU inference speedup by up to 2.9x, while on FPGA-based systolic-array accelerators, reducing B-spline table precision from 8 to 3 bits cuts resource usage by 36%, increases clock frequency by 50%, and enhances speedup by 1.24x. On a 28nm FD-SOI ASIC, reducing the B-spline bit-width from 16 to 3 bits achieves 72% area reduction and 50% higher maximum frequency.

翻译：Kolmogorov-Arnold网络（KANs）因其在参数量效率和可解释性方面可能超越多层感知机（MLPs）而受到关注。与传统MLPs不同，KANs使用可学习的非线性激活函数，通常是样条函数，这些函数表示为基样条（B-splines）的线性组合。B样条系数作为模型的可学习参数。然而，评估这些样条函数会增加推理过程中的计算复杂度。传统量化通过降低参数和激活值的数值精度来减少这种复杂度。然而，量化对KANs的影响，尤其是其在降低计算复杂度方面的有效性，在很大程度上尚未得到探索，特别是在低于8位的量化级别上。本研究探讨了低位量化对KANs的影响，及其对计算复杂度和硬件效率的影响。结果表明，B样条可以量化至2-3位而精度损失可忽略不计，从而显著降低计算复杂度。因此，我们研究了使用低位量化预计算表替代递归B样条算法的潜力。该方法旨在进一步降低KANs的计算复杂度并提高硬件效率，同时保持精度。例如，ResKAN18通过使用低位量化的B样条表，在精度无损的情况下实现了BitOps的50倍降低。此外，预计算的8位查找表将GPU推理加速提升高达2.9倍，而在基于FPGA的脉动阵列加速器上，将B样条表精度从8位降至3位可减少36%的资源使用，提高50%的时钟频率，并实现1.24倍的加速。在28nm FD-SOI ASIC上，将B样条位宽从16位降至3位可实现72%的面积缩减和50%的最大频率提升。