Language-audio joint representation learning frameworks typically depend on deterministic embeddings, assuming a one-to-one correspondence between audio and text. In real-world settings, however, the language-audio relationship is inherently many-to-many: one audio segment can be described by multiple captions and vice versa. To address this, we propose Probabilistic Language-Audio Pre-training (ProLAP), which models multiplicity as the spread of probability distributions in a joint language-audio embedding space. To train the intra-modal hierarchical relationship effectively, we also introduce two objectives: (i) hierarchical inclusion loss to promote semantic hierarchical understanding of inputs and (ii) mask repulsive loss to improve the efficiency of learning when optimizing the hierarchical inclusion loss. With this training strategy, our model can learn the hierarchical structure inherent in the data even from small datasets, in contrast to prior probabilistic approaches that rely on large-scale datasets. In our experiments, ProLAP outperforms existing deterministic approaches on audio-text retrieval tasks. Moreover, through experiments on the audio traversal task introduced in this paper, we demonstrate that ProLAP captures the plausible semantic hierarchy.
翻译:语言-音频联合表征学习框架通常依赖于确定性嵌入,假设音频与文本之间存在一一对应关系。然而在现实场景中,语言-音频关系本质上是多对多的:一个音频片段可由多个文本描述,反之亦然。为解决此问题,我们提出概率语言-音频预训练(ProLAP),将多重性建模为联合语言-音频嵌入空间中概率分布的扩散范围。为有效训练模态内层级关系,我们还引入两个目标函数:(i)层级包含损失,以促进对输入的语义层级理解;(ii)掩码排斥损失,在优化层级包含损失时提升学习效率。通过此训练策略,我们的模型即使从小规模数据集中也能学习数据固有的层级结构,这与先前依赖大规模数据集的概率方法形成对比。实验表明,ProLAP在音频-文本检索任务上优于现有确定性方法。此外,通过本文提出的音频遍历任务实验,我们证明ProLAP能够捕捉合理的语义层级结构。