Conventional speech-to-text translation (ST) systems are trained on single-speaker utterances, and they may not generalize to real-life scenarios where the audio contains conversations by multiple speakers. In this paper, we tackle single-channel multi-speaker conversational ST with an end-to-end and multi-task training model, named Speaker-Turn Aware Conversational Speech Translation, that combines automatic speech recognition, speech translation and speaker turn detection using special tokens in a serialized labeling format. We run experiments on the Fisher-CALLHOME corpus, which we adapted by merging the two single-speaker channels into one multi-speaker channel, thus representing the more realistic and challenging scenario with multi-speaker turns and cross-talk. Experimental results across single- and multi-speaker conditions and against conventional ST systems, show that our model outperforms the reference systems on the multi-speaker condition, while attaining comparable performance on the single-speaker condition. We release scripts for data processing and model training.
翻译:传统的语音到文本翻译(ST)系统是在单说话人话语上训练的,可能无法泛化到音频包含多说话人对话的真实场景。本文提出了一种端到端多任务训练模型——说话人轮换感知会话语音翻译,该模型通过序列化标签格式中的特殊标记,将自动语音识别、语音翻译和说话人轮换检测相结合,用于处理单通道多说话人会话翻译。我们在Fisher-CALLHOME语料库上进行了实验,通过将两个单说话人通道合并为一个多说话人通道来改编该语料库,从而模拟了更具挑战性的多说话人轮换和交叉对话场景。在单说话人和多说话人条件下的实验结果表明,与传统的ST系统相比,我们的模型在多说话人条件下优于参考系统,同时在单说话人条件下达到相当的性能。我们公开了数据处理和模型训练的脚本。