We introduce ParaSpeechCLAP, a dual-encoder contrastive model that maps speech and text style captions into a common embedding space, supporting a wide range of intrinsic (speaker-level) and situational (utterance-level) descriptors (such as pitch, texture and emotion) far beyond the narrow set handled by existing models. We train specialized ParaSpeechCLAP-Intrinsic and ParaSpeechCLAP-Situational models alongside a unified ParaSpeechCLAP-Combined model, finding that specialization yields stronger performance on individual style dimensions while the unified model excels on compositional evaluation. We further show that ParaSpeechCLAP-Intrinsic benefits from an additional classification loss and class-balanced training. We demonstrate our models' performance on style caption retrieval, speech attribute classification and as an inference-time reward model that improves style-prompted TTS without additional training. ParaSpeechCLAP outperforms baselines on most metrics across all three applications. Our models and code are released at https://github.com/ajd12342/paraspeechclap .
翻译:我们提出ParaSpeechCLAP,一种将语音与文本风格描述映射至共同嵌入空间的双编码器对比模型,支持远超现有模型所处理的狭窄范围、涵盖内在(说话人级)与情境(话语级)的广泛描述符(如音高、音质和情感)。我们分别训练专门的ParaSpeechCLAP-内在模型与ParaSpeechCLAP-情境模型,并联合训练统一的ParaSpeechCLAP-组合模型,发现专门化模型在单一风格维度上表现更强,而统一模型在组合评估中表现优异。我们进一步证明,ParaSpeechCLAP-内在模型受益于额外的分类损失与类别平衡训练。我们展示了模型在风格描述检索、语音属性分类任务中的性能,并作为推理时的奖励模型,无需额外训练即可提升风格提示文本转语音(TTS)的效果。ParaSpeechCLAP在全部三个应用的大多数指标上均优于基线方法。我们的模型与代码已开源至https://github.com/ajd12342/paraspeechclap 。