Contrastive language-audio pretraining (CLAP) has recently emerged as a method for making audio analysis more generalisable. Specifically, CLAP-style models are able to `answer' a diverse set of language queries, extending the capabilities of audio models beyond a closed set of labels. However, CLAP relies on a large set of (audio, query) pairs for pretraining. While such sets are available for general audio tasks, like captioning or sound event detection, there are no datasets with matched audio and text queries for computational paralinguistic (CP) tasks. As a result, the community relies on generic CLAP models trained for general audio with limited success. In the present study, we explore training considerations for ParaCLAP, a CLAP-style model suited to CP, including a novel process for creating audio-language queries. We demonstrate its effectiveness on a set of computational paralinguistic tasks, where it is shown to surpass the performance of open-source state-of-the-art models.
翻译:对比式语言-音频预训练(CLAP)作为一种提升音频分析泛化能力的方法近期受到关注。具体而言,CLAP类模型能够"响应"多样化的语言查询,从而将音频模型的能力从封闭标签集扩展到开放领域。然而,CLAP依赖于大规模(音频,查询)配对数据进行预训练。虽然此类数据在通用音频任务(如描述生成或声音事件检测)中可获得,但针对计算副语言学任务尚不存在匹配的音频-文本查询数据集。因此,当前研究领域依赖为通用音频训练的CLAP模型,但其效果有限。本研究探讨了适用于CP任务的CLAP类模型ParaCLAP的训练策略,包括创建音频-语言查询的新流程。我们在系列计算副语言学任务上验证了其有效性,实验表明该模型性能超越了开源最先进模型。