Speech-to-speech translation systems today do not adequately support use for dialog purposes. In particular, nuances of speaker intent and stance can be lost due to improper prosody transfer. We present an exploration of what needs to be done to overcome this. First, we developed a data collection protocol in which bilingual speakers re-enact utterances from an earlier conversation in their other language, and used this to collect an English-Spanish corpus, so far comprising 1871 matched utterance pairs. Second, we developed a simple prosodic dissimilarity metric based on Euclidean distance over a broad set of prosodic features. We then used these to investigate cross-language prosodic differences, measure the likely utility of three simple baseline models, and identify phenomena which will require more powerful modeling. Our findings should inform future research on cross-language prosody and the design of speech-to-speech translation systems capable of effective prosody transfer.
翻译:当前语音到语音翻译系统未能充分支持对话场景。特别是,由于不当的韵律迁移,说话者意图和立场的细微差异可能会丢失。我们探索了如何克服这一问题的关键步骤。首先,我们设计了一套数据采集方案,由双语使用者用另一种语言复述先前对话中的语句,并据此构建了包含1871组英西语句对的平行语料库。其次,我们基于一组广泛韵律特征的欧氏距离,开发了简单的韵律差异度量指标。利用这些资源,我们研究了跨语言韵律差异,评估了三种简单基线模型的潜在效用,并识别出需要更强大模型建模的现象。本研究将为跨语言韵律迁移的后续研究,以及能实现有效韵律迁移的语音到语音翻译系统设计提供指导。