The ambiguity of human emotions poses several challenges for machine learning models, as they often overlap and lack clear delineating boundaries. Contrastive language-audio pretraining (CLAP) has emerged as a key technique for generalisable emotion recognition. However, as conventional CLAP enforces a strict one-to-one alignment between paired audio-text samples, it overlooks intra-modal similarity and treats all non-matching pairs as equally negative. This conflicts with the fuzzy boundaries between different emotions. To address this limitation, we propose SmoothCLAP, which introduces softened targets derived from intra-modal similarity and paralinguistic features. By combining these softened targets with conventional contrastive supervision, SmoothCLAP learns embeddings that respect graded emotional relationships, while retaining the same inference pipeline as CLAP. Experiments on eight affective computing tasks across English and German demonstrate that SmoothCLAP is consistently achieving superior performance. Our results highlight that leveraging soft supervision is a promising strategy for building emotion-aware audio-text models.
翻译:人类情感的模糊性给机器学习模型带来了诸多挑战,因为情感往往相互重叠且缺乏清晰的界定边界。对比语言-音频预训练(CLAP)已成为实现可泛化情感识别的关键技术。然而,由于传统CLAP强制要求配对音频-文本样本间严格的一一对应关系,其忽略了模态内相似性,并将所有非匹配样本对均视为同等负例。这与不同情感间的模糊边界存在冲突。为克服这一局限,本文提出SmoothCLAP方法,通过引入基于模态内相似性与副语言特征生成的软化目标,将传统对比监督与软化目标相结合进行训练。SmoothCLAP能够学习到符合情感等级关系的嵌入表示,同时保持与CLAP完全一致的推理流程。在涵盖英语与德语的八项情感计算任务上的实验表明,SmoothCLAP始终取得更优性能。我们的研究结果证明,利用软监督是构建情感感知音频-文本模型的有效策略。