Transformer-based speech enhancement models yield impressive results. However, their heterogeneous and complex structure restricts model compression potential, resulting in greater complexity and reduced hardware efficiency. Additionally, these models are not tailored for streaming and low-power applications. Addressing these challenges, this paper proposes a low-power streaming speech enhancement accelerator through model and hardware optimization. The proposed high performance model is optimized for hardware execution with the co-design of model compression and target application, which reduces 93.9\% of model size by the proposed domain-aware and streaming-aware pruning techniques. The required latency is further reduced with batch normalization-based transformers. Additionally, we employed softmax-free attention, complemented by an extra batch normalization, facilitating simpler hardware design. The tailored hardware accommodates these diverse computing patterns by breaking them down into element-wise multiplication and accumulation (MAC). This is achieved through a 1-D processing array, utilizing configurable SRAM addressing, thereby minimizing hardware complexities and simplifying zero skipping. Using the TSMC 40nm CMOS process, the final implementation requires merely 207.8K gates and 53.75KB SRAM. It consumes only 8.08 mW for real-time inference at a 62.5MHz frequency.
翻译:基于Transformer的语音增强模型取得了令人瞩目的成果。然而,其异构且复杂的结构限制了模型压缩的潜力,导致更高的复杂度和降低的硬件效率。此外,这些模型并非专为流式处理和低功耗应用而设计。针对这些挑战,本文通过模型与硬件协同优化,提出了一种低功耗流式语音增强加速器。所提出的高性能模型通过模型压缩与目标应用的协同设计,针对硬件执行进行了优化,利用提出的领域感知与流式感知剪枝技术,将模型大小减少了93.9%。通过采用基于批归一化的Transformer,进一步降低了所需延迟。此外,我们采用了无softmax注意力机制,并辅以额外的批归一化,从而简化了硬件设计。定制的硬件通过将多样化的计算模式分解为逐元素乘累加(MAC)运算来适应它们。这是通过一维处理阵列实现的,该阵列利用可配置的SRAM寻址,从而最大限度地降低了硬件复杂度并简化了零值跳过。采用台积电40nm CMOS工艺,最终实现仅需207.8K门电路和53.75KB SRAM。在62.5MHz频率下进行实时推理时,功耗仅为8.08 mW。