Quantization of foundational models (FMs) is significantly more challenging than traditional DNNs due to the emergence of large magnitude features called outliers. Existing outlier-aware algorithm/architecture co-design techniques either use mixed-precision, retaining outliers at high precision but compromise hardware efficiency, or quantize inliers and outliers at the same precision, improving hardware efficiency at the cost of accuracy. To address this mutual exclusivity, in this paper, we propose MicroScopiQ, a novel co-design technique that leverages pruning to complement outlier-aware quantization. MicroScopiQ retains outliers at higher precision while pruning a certain fraction of least important weights to distribute the additional outlier bits; ensuring high accuracy, aligned memory and hardware efficiency. We design a high-throughput, low overhead accelerator architecture composed of simple multi-precision INT processing elements and a novel network-on-chip called ReCoN that efficiently abstracts the complexity of supporting high-precision outliers. Additionally, unlike existing alternatives, MicroScopiQ does not assume any locality of outlier weights, enabling applicability to a broad range of FMs. Extensive experiments across various quantization settings show that MicroScopiQ achieves SoTA quantization performance while simultaneously improving inference performance by 3x and reducing energy by 2x over existing alternatives.
翻译:基础模型的量化比传统深度神经网络更具挑战性,原因在于会出现被称为离群值的大幅值特征。现有的离群感知算法/架构协同设计技术要么采用混合精度,以高精度保留离群值但牺牲硬件效率;要么以相同精度量化内围值和离群值,以提高硬件效率为代价换取精度损失。为解决这种互斥性,本文提出MicroScopiQ,一种利用剪枝来补充离群感知量化的新型协同设计技术。MicroScopiQ以较高精度保留离群值,同时剪除一定比例最不重要的权重以分配额外的离群值比特位,从而确保高精度、对齐的内存和硬件效率。我们设计了一种高吞吐量、低开销的加速器架构,该架构由简单的多精度INT处理单元和一种称为ReCoN的新型片上网络组成,能有效抽象支持高精度离群值的复杂性。此外,与现有方案不同,MicroScopiQ不假设离群值权重具有任何局部性,从而可广泛应用于各类基础模型。在各种量化设置下的大量实验表明,MicroScopiQ实现了最先进的量化性能,同时与现有方案相比,推理性能提升3倍,能耗降低2倍。