Many existing speech translation benchmarks focus on native-English speech in high-quality recording conditions, which often do not match the conditions in real-life use-cases. In this paper, we describe our speech translation system for the multilingual track of IWSLT 2023, which evaluates translation quality on scientific conference talks. The test condition features accented input speech and terminology-dense contents. The task requires translation into 10 languages of varying amounts of resources. In absence of training data from the target domain, we use a retrieval-based approach (kNN-MT) for effective adaptation (+0.8 BLEU for speech translation). We also use adapters to easily integrate incremental training data from data augmentation, and show that it matches the performance of re-training. We observe that cascaded systems are more easily adaptable towards specific target domains, due to their separate modules. Our cascaded speech system substantially outperforms its end-to-end counterpart on scientific talk translation, although their performance remains similar on TED talks.
翻译:现有的许多语音翻译基准测试主要关注高质量录音条件下的母语英语语音,这往往与真实使用场景中的条件不符。本文描述了我们在 IWSLT 2023 多语种赛道中使用的语音翻译系统,该系统评估的是科学会议演讲的翻译质量。测试条件包含带口音的输入语音和术语密集的内容。任务要求将内容翻译成资源量各异的10种语言。在缺少目标领域训练数据的情况下,我们采用基于检索的方法(kNN-MT)进行有效领域适配(语音翻译提升0.8 BLEU)。我们还使用适配器轻松整合来自数据增强的增量训练数据,并证明其性能可与重新训练相媲美。我们观察到,级联系统因其独立的模块结构,更容易针对特定目标领域进行适配。在科学会议演讲翻译任务中,我们的级联语音系统显著优于端到端系统,但两者在TED演讲翻译上的性能仍保持相近水平。