Although automatic speech recognition (ASR) can perform well in common non-overlapping environments, sustaining performance in multi-speaker overlapping speech recognition remains challenging. Recent research revealed that ASR model's encoder captures different levels of information with different layers -- the lower layers tend to have more acoustic information, and the upper layers more linguistic. This inspires us to develop a Sidecar separator to empower a well-trained ASR model for multi-speaker scenarios by separating the mixed speech embedding between two suitable layers. We experimented with a wav2vec 2.0-based ASR model with a Sidecar mounted. By freezing the parameters of the original model and training only the Sidecar (8.7 M, 8.4% of all parameters), the proposed approach outperforms the previous state-of-the-art by a large margin for the 2-speaker mixed LibriMix dataset, reaching a word error rate (WER) of 10.36%; and obtains comparable results (7.56%) for LibriSpeechMix dataset when limited training.
翻译:尽管自动语音识别(ASR)在常见的非重叠环境下表现良好,但在多说话人重叠语音识别中保持性能仍具挑战性。近期研究表明,ASR模型编码器的不同层捕获不同层次的信息——较低层倾向于包含更多声学信息,较高层则蕴含更多语言学信息。这启发我们开发了一种侧置分离器(Sidecar Separator),通过将混合语音嵌入在两个适宜层之间分离,使训练良好的ASR模型能够应对多说话人场景。我们采用基于wav2vec 2.0的ASR模型并搭载侧置分离器进行实验。通过冻结原始模型参数仅训练侧置分离器(870万参数,占全部参数的8.4%),所提方法在双说话人混合的LibriMix数据集上以10.36%的词错误率(WER)大幅超越此前最优方法;在训练数据有限时,对LibriSpeechMix数据集也取得了相当的结果(7.56%)。