Quantization has become essential for the efficient deployment of speech processing systems. Although widely studied, most existing quantization methods were developed for vision and NLP architectures, while the specific challenges of audio signals remain largely overlooked. In particular, we show that audio activations can exhibit large calibration ranges, leading to significant information loss when standard calibration techniques are applied. To address this, we propose ESC, an Evolution Strategy-based Calibration method that formulates activation scaling as an optimization problem and solves it using a two-step local-global scheme driven by an evolution strategy. ESC enables unaltered performance under full INT8 quantization and is the first calibration method to achieve near-lossless performance for full INT4 quantization across multiple speech tasks. Integrating ESC with PTQ methods further reduces performance loss, achieving a 1% relative accuracy degradation on the AST model.
翻译:量化已成为高效部署语音处理系统的关键技术。尽管已有广泛研究,现有量化方法大多针对视觉与自然语言处理架构而设计,音频信号特有的挑战在很大程度上仍未得到充分关注。我们特别指出,音频激活值可能呈现较大的校准范围,导致应用标准校准技术时产生显著的信息损失。为此,我们提出ESC——一种基于进化策略的校准方法,该方法将激活值缩放构建为优化问题,并通过采用进化策略驱动的两步局部-全局方案进行求解。ESC能够在全INT8量化下保持性能无损,并且是首个在多种语音任务中实现全INT4量化近乎无损性能的校准方法。将ESC与PTQ方法相结合可进一步降低性能损失,在AST模型上仅产生1%的相对准确率下降。