End-to-end speech translation (ST) for conversation recordings involves several under-explored challenges such as speaker diarization (SD) without accurate word time stamps and handling of overlapping speech in a streaming fashion. In this work, we propose DiariST, the first streaming ST and SD solution. It is built upon a neural transducer-based streaming ST system and integrates token-level serialized output training and t-vector, which were originally developed for multi-talker speech recognition. Due to the absence of evaluation benchmarks in this area, we develop a new evaluation dataset, DiariST-AliMeeting, by translating the reference Chinese transcriptions of the AliMeeting corpus into English. We also propose new metrics, called speaker-agnostic BLEU and speaker-attributed BLEU, to measure the ST quality while taking SD accuracy into account. Our system achieves a strong ST and SD capability compared to offline systems based on Whisper, while performing streaming inference for overlapping speech. To facilitate the research in this new direction, we release the evaluation data, the offline baseline systems, and the evaluation code.
翻译:端到端语音翻译在处理对话录音时面临若干未被充分探索的挑战,例如缺乏精确词级时间戳的说话人日志以及以流式方式处理重叠语音。本文提出DiariST,首个集成流式语音翻译与说话人日志的解决方案。该系统基于神经转换器的流式语音翻译框架构建,并整合了原本用于多说话人语音识别的令牌级序列化输出训练与t-向量技术。针对该领域缺乏评估基准的问题,我们通过将AliMeeting语料库的中文参考转录翻译为英文,开发了新的评估数据集DiariST-AliMeeting。同时提出两类新指标——说话人无关BLEU与说话人归因BLEU——在兼顾说话人日志准确性的前提下衡量语音翻译质量。实验表明,与基于Whisper的离线系统相比,本系统具备强大的语音翻译与说话人日志能力,且能对重叠语音进行流式推理。为促进该新方向的研究,我们公开了评估数据、离线基线系统及评估代码。