Revisiting Acoustic Features for Robust ASR

Automatic Speech Recognition (ASR) systems must be robust to the myriad types of noises present in real-world environments including environmental noise, room impulse response, special effects as well as attacks by malicious actors (adversarial attacks). Recent works seek to improve accuracy and robustness by developing novel Deep Neural Networks (DNNs) and curating diverse training datasets for them, while using relatively simple acoustic features. While this approach improves robustness to the types of noise present in the training data, it confers limited robustness against unseen noises and negligible robustness to adversarial attacks. In this paper, we revisit the approach of earlier works that developed acoustic features inspired by biological auditory perception that could be used to perform accurate and robust ASR. In contrast, Specifically, we evaluate the ASR accuracy and robustness of several biologically inspired acoustic features. In addition to several features from prior works, such as gammatone filterbank features (GammSpec), we also propose two new acoustic features called frequency masked spectrogram (FreqMask) and difference of gammatones spectrogram (DoGSpec) to simulate the neuro-psychological phenomena of frequency masking and lateral suppression. Experiments on diverse models and datasets show that (1) DoGSpec achieves significantly better robustness than the highly popular log mel spectrogram (LogMelSpec) with minimal accuracy degradation, and (2) GammSpec achieves better accuracy and robustness to non-adversarial noises from the Speech Robust Bench benchmark, but it is outperformed by DoGSpec against adversarial attacks.

翻译：自动语音识别（ASR）系统必须对现实环境中存在的多种噪声类型具备鲁棒性，这些噪声包括环境噪声、房间脉冲响应、特殊音效以及恶意行为者的攻击（对抗性攻击）。近期研究主要通过开发新颖的深度神经网络（DNNs）并为其整理多样化的训练数据集来提高准确性和鲁棒性，同时使用相对简单的声学特征。尽管这种方法提升了模型对训练数据中存在的噪声类型的鲁棒性，但其对未见噪声的鲁棒性有限，并且对对抗性攻击的鲁棒性微乎其微。在本文中，我们重新审视了早期研究的方法，这些方法开发了受生物听觉感知启发的声学特征，可用于执行准确且鲁棒的ASR。具体而言，我们评估了几种受生物启发的声学特征的ASR准确性和鲁棒性。除了先前工作中的几种特征（例如伽马通滤波器组特征（GammSpec）），我们还提出了两种新的声学特征，分别称为频率掩蔽谱图（FreqMask）和伽马通差分谱图（DoGSpec），以模拟频率掩蔽和侧向抑制的神经心理现象。在不同模型和数据集上的实验表明：（1）DoGSpec在准确性下降最小的情况下，比非常流行的对数梅尔谱图（LogMelSpec）实现了显著更好的鲁棒性；（2）GammSpec在Speech Robust Bench基准测试的非对抗性噪声上取得了更好的准确性和鲁棒性，但在对抗性攻击方面，其表现被DoGSpec超越。