As speech processing systems in mobile and edge devices become more commonplace, the demand for unintrusive speech quality monitoring increases. Deep learning methods provide high-quality estimates of objective and subjective speech quality metrics. However, their significant computational requirements are often prohibitive on resource-constrained devices. To address this issue, we investigated binary activation maps (BAMs) for speech quality prediction on a convolutional architecture based on DNSMOS. We show that the binary activation model with quantization aware training matches the predictive performance of the baseline model. It further allows using other compression techniques. Combined with 8-bit weight quantization, our approach results in a 25-fold memory reduction during inference, while replacing almost all dot products with summations. Our findings show a path toward substantial resource savings by supporting mixed-precision binary multiplication in hard- and software.
翻译:随着移动与边缘设备中语音处理系统的日益普及,对非侵入式语音质量监测的需求不断增长。深度学习方法能够为客观与主观语音质量指标提供高质量的估计。然而,其显著的计算需求往往在资源受限设备上难以实现。为解决这一问题,我们研究了基于DNSMOS的卷积架构中二元激活映射(BAMs)在语音质量预测中的应用。研究表明,采用量化感知训练的二元激活模型在预测性能上可与基线模型相媲美。该方法还为进一步应用其他压缩技术提供了可能。结合8位权重量化,我们的方法在推理过程中实现了25倍的内存缩减,同时将几乎所有的点积运算替换为求和运算。研究结果表明,通过在硬件与软件层面支持混合精度二元乘法运算,可实现显著的资源节约。