In deep learning inference, model parameters are pruned and quantized to reduce the model size. Compression methods and common subexpression (CSE) elimination algorithms are applied on sparse constant matrices to deploy the models on low-cost embedded devices. However, the state-of-the-art CSE elimination methods do not scale well for handling large matrices. They reach hours for extracting CSEs in a $200 \times 200$ matrix while their matrix multiplication algorithms execute longer than the conventional matrix multiplication methods. Besides, there exist no compression methods for matrices utilizing CSEs. As a remedy to this problem, a random search-based algorithm is proposed in this paper to extract CSEs in the column pairs of a constant matrix. It produces an adder tree for a $1000 \times 1000$ matrix in a minute. To compress the adder tree, this paper presents a compression format by extending the Compressed Sparse Row (CSR) to include CSEs. While compression rates of more than $50\%$ can be achieved compared to the original CSR format, simulations for a single-core embedded system show that the matrix multiplication execution time can be reduced by $20\%$.
翻译:在深度学习推理中,模型参数被剪枝和量化以减小模型规模。压缩方法与公共子表达式消除算法被应用于稀疏常数矩阵,以实现模型在低成本嵌入式设备上的部署。然而,现有最先进的公共子表达式消除方法在处理大规模矩阵时扩展性不佳:处理一个$200 \times 200$的矩阵需要数小时来提取公共子表达式,其矩阵乘法算法的执行时间也长于传统矩阵乘法方法。此外,目前尚不存在利用公共子表达式对矩阵进行压缩的方法。针对这一问题,本文提出了一种基于随机搜索的算法,用于提取常数矩阵列对中的公共子表达式。该算法可在1分钟内为一个$1000 \times 1000$的矩阵生成加法树。为压缩该加法树,本文通过扩展压缩稀疏行格式以包含公共子表达式,提出了一种压缩格式。相比原始压缩稀疏行格式,压缩率可超过50%,而针对单核嵌入式系统的仿真表明,矩阵乘法执行时间可减少20%。