HiKonv: Maximizing the Throughput of Quantized Convolution With Novel Bit-wise Management and Computation

Quantization for CNN has shown significant progress with the intention of reducing the cost of computation and storage with low-bitwidth data representations. There are, however, no systematic studies on how an existing full-bitwidth processing unit, such as ALU in CPUs and DSP in FPGAs, can be better utilized to deliver significantly higher computation throughput for convolution under various quantized bitwidths. In this study, we propose HiKonv, a unified solution that maximizes the throughput of convolution on a given underlying processing unit with low-bitwidth quantized data inputs through novel bit-wise management and parallel computation. We establish theoretical framework and performance models using a full-bitwidth multiplier for highly parallelized low-bitwidth convolution, and demonstrate new breakthroughs for high-performance computing in this critical domain. For example, a single 32-bit processing unit in CPU can deliver 128 binarized convolution operations (multiplications and additions) and 13 4-bit convolution operations with a single multiplication instruction, and a single 27x18 multiplier in the FPGA DSP can deliver 60, 8 or 2 convolution operations with 1, 4 or 8-bit inputs in one clock cycle. We demonstrate the effectiveness of HiKonv on both CPU and FPGA. On CPU, HiKonv outperforms the baseline implementation with 1 to 8-bit inputs and provides up to 7.6x and 1.4x performance improvements for 1-D convolution, and performs 2.74x and 3.19x over the baseline implementation for 4-bit signed and unsigned data inputs for 2-D convolution. On FPGA, HiKonv solution enables a single DSP to process multiple convolutions with a shorter processing latency. For binarized input, each DSP with HiKonv is equivalent up to 76.6 LUTs. Compared to the DAC-SDC 2020 champion model, HiKonv achieves a 2.37x throughput improvement and 2.61x DSP efficiency improvement, respectively.

翻译：量化技术旨在以低位宽数据表示降低CNN的计算与存储成本，并已取得显著进展。然而，现有全位宽处理单元（如CPU中的ALU和FPGA中的DSP）如何被更优利用，以在不同量化位宽下实现卷积计算吞吐量的显著提升，尚缺乏系统性研究。本文提出HiKonv这一统一解决方案，通过新颖的位级管理与并行计算方法，在给定底层处理单元上利用低位宽量化数据输入最大化卷积吞吐量。我们基于全位宽乘法器构建了高度并行化低位宽卷积的理论框架与性能模型，并在该关键领域展现了高性能计算的新突破。例如，CPU中单个32位处理单元通过单条乘法指令可执行128次二值化卷积运算（乘法与加法）及13次4位卷积运算；FPGA DSP中单个27×18乘法器在一个时钟周期内可处理60、8或2次分别对应1位、4位或8位输入的卷积运算。我们在CPU与FPGA上验证了HiKonv的有效性。在CPU上，HiKonv在1至8位输入下性能均优于基线实现，一维卷积性能提升最高达7.6倍与1.4倍，二维卷积中4位有符号与无符号数据输入的性能分别达到基线实现的2.74倍与3.19倍。在FPGA上，HiKonv方案使单个DSP能以更短处理延迟完成多个卷积操作。对于二值化输入，每个采用HiKonv的DSP等效于76.6个LUT。与DAC-SDC 2020冠军模型相比，HiKonv分别实现了2.37倍的吞吐量提升与2.61倍的DSP效率提升。