We introduce EmphAssess, a prosodic benchmark designed to evaluate the capability of speech-to-speech models to encode and reproduce prosodic emphasis. We apply this to two tasks: speech resynthesis and speech-to-speech translation. In both cases, the benchmark evaluates the ability of the model to encode emphasis in the speech input and accurately reproduce it in the output, potentially across a change of speaker and language. As part of the evaluation pipeline, we introduce EmphaClass, a new model that classifies emphasis at the frame or word level.
翻译:本文介绍EmphAssess,一个旨在评估语音到语音模型编码与再现韵律强调能力的韵律基准。我们将其应用于两项任务:语音再合成与语音到语音翻译。在这两种情况下,该基准评估模型在语音输入中编码强调信息并在输出中准确再现的能力,且该过程可能涉及说话者与语言的转换。作为评估流程的一部分,我们提出了EmphaClass——一种在帧级或词级对强调进行分类的新模型。