In this study, we present synchronous bilingual Connectionist Temporal Classification (CTC), an innovative framework that leverages dual CTC to bridge the gaps of both modality and language in the speech translation (ST) task. Utilizing transcript and translation as concurrent objectives for CTC, our model bridges the gap between audio and text as well as between source and target languages. Building upon the recent advances in CTC application, we develop an enhanced variant, BiL-CTC+, that establishes new state-of-the-art performances on the MuST-C ST benchmarks under resource-constrained scenarios. Intriguingly, our method also yields significant improvements in speech recognition performance, revealing the effect of cross-lingual learning on transcription and demonstrating its broad applicability. The source code is available at https://github.com/xuchennlp/S2T.
翻译:本研究提出同步双语连接主义时序分类(Synchronous Bilingual Connectionist Temporal Classification, 简称CTC)这一创新框架,通过双CTC机制同时弥合语音翻译任务中的模态差异与语言差异。本模型将转写文本与翻译文本作为CTC的双重优化目标,既实现了音频与文本之间的跨模态对齐,又完成了源语言与目标语言的跨语言映射。基于CTC技术的最新进展,我们开发了增强变体BiL-CTC+,在资源受限场景下的MuST-C语音翻译基准测试中取得最新最优性能。值得关注的是,该方法在语音识别任务中同样展现出显著的性能提升,揭示了跨语言学习对转写任务的积极影响,彰显了其广泛适用性。源代码已开源至https://github.com/xuchennlp/S2T。