The prosody of a spoken utterance, including features like stress, intonation and rhythm, can significantly affect the underlying semantics, and as a consequence can also affect its textual translation. Nevertheless, prosody is rarely studied within the context of speech-to-text translation (S2TT) systems. In particular, end-to-end (E2E) systems have been proposed as well-suited for prosody-aware translation because they have direct access to the speech signal when making translation decisions, but the understanding of whether this is successful in practice is still limited. A main challenge is the difficulty of evaluating prosody awareness in translation. To address this challenge, we introduce an evaluation methodology and a focused benchmark (named ContraProST) aimed at capturing a wide range of prosodic phenomena. Our methodology uses large language models and controllable text-to-speech (TTS) to generate contrastive examples. Through experiments in translating English speech into German, Spanish, and Japanese, we find that (a) S2TT models possess some internal representation of prosody, but the prosody signal is often not strong enough to affect the translations, (b) E2E systems outperform cascades of speech recognition and text translation systems, confirming their theoretical advantage in this regard, and (c) certain cascaded systems also capture prosodic information in the translation, but only to a lesser extent that depends on the particulars of the transcript's surface form.
翻译:口语话语的韵律,包括重音、语调和节奏等特征,能显著影响其底层语义,进而影响其文本翻译。然而,在语音到文本翻译(S2TT)系统的研究中,韵律却很少被探讨。特别是,端到端(E2E)系统被认为非常适合进行韵律感知的翻译,因为它们在做出翻译决策时能直接访问语音信号,但对其在实践中是否成功的理解仍然有限。一个主要挑战在于评估翻译中韵律感知的困难。为应对这一挑战,我们引入了一种评估方法和一个聚焦的基准(命名为ContraProST),旨在捕捉广泛的韵律现象。我们的方法利用大语言模型和可控文本到语音(TTS)技术来生成对比示例。通过将英语语音翻译成德语、西班牙语和日语的实验,我们发现:(a)S2TT模型确实拥有对韵律的某种内部表征,但韵律信号通常不足以强到影响翻译结果;(b)E2E系统在性能上优于语音识别与文本翻译系统的级联组合,证实了其在此方面的理论优势;以及(c)某些级联系统也能在翻译中捕捉韵律信息,但程度较轻,且依赖于转录文本表面形式的特定细节。