Multi-talker overlapped speech poses a significant challenge for speech recognition and diarization. Recent research indicated that these two tasks are inter-dependent and complementary, motivating us to explore a unified modeling method to address them in the context of overlapped speech. A recent study proposed a cost-effective method to convert a single-talker automatic speech recognition (ASR) system into a multi-talker one, by inserting a Sidecar separator into the frozen well-trained ASR model. Extending on this, we incorporate a diarization branch into the Sidecar, allowing for unified modeling of both ASR and diarization with a negligible overhead of only 768 parameters. The proposed method yields better ASR results compared to the baseline on LibriMix and LibriSpeechMix datasets. Moreover, without sophisticated customization on the diarization task, our method achieves acceptable diarization results on the two-speaker subset of CALLHOME with only a few adaptation steps.
翻译:多人重叠语音对语音识别和说话人日志构成重大挑战。近期研究表明,这两项任务相互依赖且互补,这促使我们探索在重叠语音场景下实现统一建模的方法。一项近期研究提出了一种经济有效的方法,通过在冻结的预训练单说话人自动语音识别(ASR)模型中插入Sidecar分离器,将其转换为多说话人ASR系统。在此基础上,我们在Sidecar中集成了说话人日志分支,使得ASR与说话人日志能够以仅768个参数的可忽略开销进行统一建模。在LibriMix和LibriSpeechMix数据集上,所提方法相比基线取得了更优的ASR结果。此外,无需对说话人日志任务进行复杂定制,我们的方法仅需少量自适应步骤,即可在CALLHOME的双说话人子集上取得可接受的说话人日志结果。