In this study, we address the challenge of low-rank model compression in the context of in-memory computing (IMC) architectures. Traditional pruning approaches, while effective in model size reduction, necessitate additional peripheral circuitry to manage complex dataflows and mitigate dislocation issues, leading to increased area and energy overheads. To circumvent these drawbacks, we propose leveraging low-rank compression techniques, which, unlike pruning, streamline the dataflow and seamlessly integrate with IMC architectures. However, low-rank compression presents its own set of challenges, namely i) suboptimal IMC array utilization and ii) compromised accuracy. To address these issues, we introduce a novel approach i) employing shift and duplicate kernel (SDK) mapping technique, which exploits idle IMC columns for parallel processing, and ii) group low-rank convolution, which mitigates the information imbalance in the decomposed matrices. Our experimental results demonstrate that our proposed method achieves up to 2.5x speedup or +20.9% accuracy boost over existing pruning techniques.
翻译:本研究针对存内计算架构中的低秩模型压缩挑战展开探讨。传统的剪枝方法虽然在模型规模缩减方面效果显著,但需要额外的外围电路来管理复杂数据流并缓解错位问题,从而导致面积和能耗开销增加。为规避这些缺陷,我们提出利用低秩压缩技术,与剪枝不同,该技术能简化数据流并无缝集成于存内计算架构。然而,低秩压缩自身也存在一系列挑战,即:i) 存内计算阵列利用率欠佳,以及 ii) 精度损失。为解决这些问题,我们提出一种创新方法:i) 采用移位复制核映射技术,利用空闲的存内计算列进行并行处理;ii) 引入分组低秩卷积,以缓解分解矩阵中的信息不均衡问题。实验结果表明,相较于现有剪枝技术,我们所提方法最高可实现2.5倍加速或20.9%的精度提升。