Transcribing the speech of multiple overlapping speakers typically requires separating the audio into multiple streams and recognizing each one independently. More recent work jointly separates and transcribes, but requires a separate decoding component for each speaker. We propose the TOGGL model to simultaneously transcribe the speech of multiple speakers. The TOGGL model uses special output tokens to attribute the speech to each speaker with only a single decoder. Our approach generalizes beyond two speakers, even when trained only on two-speaker data. We demonstrate superior performance compared to competing approaches on a conversational speech dataset. Our approach also improves performance on single-speaker audio.
翻译:转录多说话人重叠语音通常需要将音频分离为多个流并独立识别每个流。近期研究虽能联合进行分离与转录,但需为每个说话人配备独立的解码组件。我们提出TOGGL模型以同时转录多说话人语音。该模型通过特殊输出标记在单一解码器中实现说话人语音归属判定。我们的方法可泛化至两个以上说话人场景,即使在仅使用双说话人数据训练时亦然。在对话语音数据集上的实验表明,本方法优于现有竞争方案。此外,该方法在单说话人音频任务中也展现出性能提升。