ESPnet-ST-v2 is a revamp of the open-source ESPnet-ST toolkit necessitated by the broadening interests of the spoken language translation community. ESPnet-ST-v2 supports 1) offline speech-to-text translation (ST), 2) simultaneous speech-to-text translation (SST), and 3) offline speech-to-speech translation (S2ST) -- each task is supported with a wide variety of approaches, differentiating ESPnet-ST-v2 from other open source spoken language translation toolkits. This toolkit offers state-of-the-art architectures such as transducers, hybrid CTC/attention, multi-decoders with searchable intermediates, time-synchronous blockwise CTC/attention, Translatotron models, and direct discrete unit models. In this paper, we describe the overall design, example models for each task, and performance benchmarking behind ESPnet-ST-v2, which is publicly available at https://github.com/espnet/espnet.
翻译:ESPnet-ST-v2 是对开源工具包 ESPnet-ST 的全面升级,旨在满足口语翻译社区日益广泛的研究需求。ESPnet-ST-v2 支持:1) 离线语音到文本翻译(ST)、2) 同步语音到文本翻译(SST)以及 3) 离线语音到语音翻译(S2ST)——每项任务均提供多种方法实现,使其区别于其他开源口语翻译工具包。该工具包集成了最先进的架构,包括转导器(transducers)、混合CTC/注意力机制、带可搜索中间表征的多解码器、时间同步分块CTC/注意力机制、Translatotron模型以及直接离散单元模型。本文阐述了ESPnet-ST-v2的整体设计、各任务示例模型及性能基准测试,该工具包已公开于 https://github.com/espnet/espnet。