Speech-to-Text Translation (S2TT) has typically been addressed with cascade systems, where speech recognition systems generate a transcription that is subsequently passed to a translation model. While there has been a growing interest in developing direct speech translation systems to avoid propagating errors and losing non-verbal content, prior work in direct S2TT has struggled to conclusively establish the advantages of integrating the acoustic signal directly into the translation process. This work proposes using contrastive evaluation to quantitatively measure the ability of direct S2TT systems to disambiguate utterances where prosody plays a crucial role. Specifically, we evaluated Korean-English translation systems on a test set containing wh-phrases, for which prosodic features are necessary to produce translations with the correct intent, whether it's a statement, a yes/no question, a wh-question, and more. Our results clearly demonstrate the value of direct translation systems over cascade translation models, with a notable 12.9% improvement in overall accuracy in ambiguous cases, along with up to a 15.6% increase in F1 scores for one of the major intent categories. To the best of our knowledge, this work stands as the first to provide quantitative evidence that direct S2TT models can effectively leverage prosody. The code for our evaluation is openly accessible and freely available for review and utilisation.
翻译:语音到文本翻译通常采用级联系统处理,即语音识别系统生成转录文本后,再由翻译模型进行处理。尽管开发直接语音翻译系统以避免错误传播和非语言信息丢失的研究日益受到关注,但现有直接语音翻译系统研究始终难以确证将声学信号直接融入翻译过程的优势。本研究提出采用对比评估方法,定量衡量直接语音翻译系统在韵律起关键作用的语句歧义消解能力。具体而言,我们针对包含疑问短语的测试集评估韩英翻译系统——这些短语需要借助韵律特征才能生成具有正确意图的翻译(包括陈述句、是非疑问句、特殊疑问句等多种类型)。实验结果明确展现了直接翻译系统相较于级联翻译模型的优势:在歧义案例中整体准确率提升12.9%,其中主要意图类别的F1分数最高提升15.6%。据我们所知,本研究首次提供直接语音到文本翻译模型能有效利用韵律特征的定量证据。评估代码已完全公开,可供查阅与使用。