Most automatic speech processing systems register degraded performance when applied to noisy or reverberant speech. But how can one tell whether speech is noisy or reverberant? We propose Brouhaha, a neural network jointly trained to extract speech/non-speech segments, speech-to-noise ratios, and C50room acoustics from single-channel recordings. Brouhaha is trained using a data-driven approach in which noisy and reverberant audio segments are synthesized. We first evaluate its performance and demonstrate that the proposed multi-task regime is beneficial. We then present two scenarios illustrating how Brouhaha can be used on naturally noisy and reverberant data: 1) to investigate the errors made by a speaker diarization model (pyannote.audio); and 2) to assess the reliability of an automatic speech recognition model (Whisper from OpenAI). Both our pipeline and a pretrained model are open source and shared with the speech community.
翻译:大多数自动语音处理系统在处理含噪或混响语音时性能会显著下降。但如何判断语音是否受到噪声或混响影响?本文提出Brouhaha——一种联合训练从单通道录音中提取语音/非语音片段、信噪比和C50房间声学参数的神经网络模型。Brouhaha采用数据驱动方法训练,通过合成含噪和混响音频片段进行学习。我们首先评估其性能,证明所提出的多任务框架具有显著优势;继而展示两个应用场景,说明Brouhaha如何用于自然含噪和混响数据:1)探究说话人日志模型(pyannote.audio)的误判原因;2)评估自动语音识别模型(OpenAI Whisper)的可靠性。本文提出的处理流程与预训练模型均以开源形式向语音学界共享。