This work introduces the Cleanformer, a streaming multichannel neural based enhancement frontend for automatic speech recognition (ASR). This model has a conformer-based architecture which takes as inputs a single channel each of raw and enhanced signals, and uses self-attention to derive a time-frequency mask. The enhanced input is generated by a multichannel adaptive noise cancellation algorithm known as Speech Cleaner, which makes use of noise context to derive its filter taps. The time-frequency mask is applied to the noisy input to produce enhanced output features for ASR. Detailed evaluations are presented with simulated and re-recorded datasets in speech-based and non-speech-based noise that show significant reduction in word error rate (WER) when using a large-scale state-of-the-art ASR model. It also will be shown to significantly outperform enhancement using a beamformer with ideal steering. The enhancement model is agnostic of the number of microphones and array configuration and, therefore, can be used with different microphone arrays without the need for retraining. It is demonstrated that performance improves with more microphones, up to 4, with each additional microphone providing a smaller marginal benefit. Specifically, for an SNR of -6dB, relative WER improvements of about 80\% are shown in both noise conditions.
翻译:本文介绍了Cleanformer——一种用于自动语音识别(ASR)的流式多通道神经网络增强前端。该模型采用基于Conformer的架构,以原始信号和增强信号的单通道作为输入,通过自注意力机制推导时频掩码。增强输入由名为Speech Cleaner的多通道自适应噪声消除算法生成,该算法利用噪声上下文计算滤波器系数。该时频掩码应用于含噪输入,为ASR生成增强后的输出特征。通过模拟和重新录制的语音与非语音噪声数据集进行详细评估,结果表明,结合大规模先进ASR模型使用时,词错误率(WER)显著降低。与使用理想导向波束形成器的增强方法相比,该模型亦展现出显著优势。该增强模型对麦克风数量和阵列构型具有无关性,因此无需重新训练即可适配不同麦克风阵列。实验证明,性能随麦克风数量增加(最多4个)而提升,每增加一个麦克风带来的边际收益递减。具体而言,在信噪比为-6dB条件下,两种噪声环境中的相对WER改进均达到约80%。