Prompt-based text-to-speech (TTS) aims to generate speech that adheres to fine-grained style cues provided in a text prompt. However, most prior works depend on neither plausible nor faithful measures to evaluate prompt adherence. That is, they cannot ensure whether the evaluation is grounded on the prompt and is similar to a human. Thus, we present a new automatic metric, the Style Prompt Adherence Metric, which explicitly satisfies both plausibility and faithfulness. Inspired by the CLAP, our approach factorizes speech into acoustic attributes and aligns them with the style prompt. Also, we trained the scorer with a supervised contrastive loss, which could provide a clearer distinction between different semantics. We conducted two experiments on two perspectives. The plausibility experiment showed that SPAM achieved a strong correlation with the mean opinion score (MOS). Also, the faithfulness experiment demonstrated that SPAM is successfully grounded to the given style prompt, as it can discriminate different semantics of the prompt. We believe that SPAM can provide a viable automatic solution for evaluating style prompt adherence of synthesized speech.
翻译:基于提示的文本转语音(TTS)旨在生成符合文本提示中细粒度风格线索的语音。然而,现有研究大多依赖既不合理也不可靠的度量来评估提示遵循度,即无法确保评估是否基于提示且与人类判断相似。为此,我们提出一种新的自动评估指标——风格提示遵循度评估指标,该指标明确满足合理性与可靠性要求。受CLAP启发,我们的方法将语音分解为声学属性并将其与风格提示对齐。同时,我们采用监督对比损失训练评分器,以更清晰地区分不同语义。我们从两个维度开展实验:合理性实验表明SPAM与平均意见得分(MOS)具有强相关性;可靠性实验证明SPAM能成功基于给定风格提示进行判别,可有效区分提示中的不同语义。我们相信SPAM能为合成语音的风格提示遵循度评估提供可行的自动解决方案。