End-to-end speech translation (ST) for conversation recordings involves several under-explored challenges such as speaker diarization (SD) without accurate word time stamps and handling of overlapping speech in a streaming fashion. In this work, we propose DiariST, the first streaming ST and SD solution. It is built upon a neural transducer-based streaming ST system and integrates token-level serialized output training and t-vector, which were originally developed for multi-talker speech recognition. Due to the absence of evaluation benchmarks in this area, we develop a new evaluation dataset, DiariST-AliMeeting, by translating the reference Chinese transcriptions of the AliMeeting corpus into English. We also propose new metrics, called speaker-agnostic BLEU and speaker-attributed BLEU, to measure the ST quality while taking SD accuracy into account. Our system achieves a strong ST and SD capability compared to offline systems based on Whisper, while performing streaming inference for overlapping speech. To facilitate the research in this new direction, we release the evaluation data, the offline baseline systems, and the evaluation code.
翻译:端到端语音翻译(ST)在对话录音场景中面临若干尚未充分探索的挑战,例如缺乏准确词级时间戳的说话人日志(SD)以及对流式场景下重叠语音的处理。本文提出DiariST,这是首个流式语音翻译与说话人日志联合解决方案。该系统基于神经转换器(neural transducer)流式语音翻译架构,并整合了最初为多说话人语音识别开发的token级序列化输出训练与t-vector技术。鉴于该领域缺乏评估基准,我们通过将阿里会议语料库(AliMeeting)的中文参考转录文本翻译为英文,构建了新的评估数据集DiariST-AliMeeting。同时提出两项新指标——说话人无关BLEU与说话人属性BLEU,用于在考虑SD准确性的前提下评估语音翻译质量。与基于Whisper的离线系统相比,我们的系统在实现重叠语音流式推理的同时,展现出强大的语音翻译与说话人日志能力。为促进该新方向的研究,我们公开发布了评估数据、离线基线系统及评估代码。