Contrastive language-audio pretraining, which aims to unify multimodal representations in a shared embedding space, serves as a cornerstone for building a wide range of applications, from cross-modal retrieval to cutting-edge multimodal large language models. However, we find that the perpendicular component of the pushing force from negative samples in contrastive learning is a double-edged sword: it contains rich supplementary information from negative samples, yet its unconstrained nature causes optimization trajectory drift and training instability. To address this, we propose Support Vector Regularization (SVR), a method that introduces an auxiliary support vector to control this perpendicular component, aiming to harness its rich information while mitigating the associated trajectory drift. The efficacy of SVR is critically governed by its semantic radius, for which we explore two unsupervised modeling strategies: direct parameterization and an adaptive radius predictor module enhanced with constraints to improve its predicting accuracy. Extensive experimental results demonstrate that our method surpasses widely used baselines like InfoNCE and SigLIP loss across classification, monolingual retrieval, and multilingual retrieval on standard audio-text datasets. Both the theoretical analysis and the experimental results on optimizing trajectory drift validate the correctness and effectiveness of our SVR method. Notably, our method is highly efficient, it operates without the need for extra training data or inference computation, and adds only a negligible overhead to the training.
翻译:对比语言-音频预训练旨在将多模态表征统一至共享嵌入空间,是构建从跨模态检索到前沿多模态大语言模型等广泛应用的基础。然而,我们发现对比学习中负样本推力的垂直分量具有双重性:它蕴含来自负样本的丰富补充信息,但其无约束特性会导致优化轨迹漂移与训练不稳定。为此,我们提出支持向量正则化方法,通过引入辅助支持向量来控制该垂直分量,旨在充分利用其丰富信息的同时缓解相关轨迹漂移。SVR 的有效性关键受其语义半径调控,为此我们探索了两种无监督建模策略:直接参数化方法,以及通过约束增强以提升预测精度的自适应半径预测器模块。大量实验结果表明,在标准音频-文本数据集上的分类、单语检索及多语检索任务中,我们的方法均优于 InfoNCE 和 SigLIP 损失等广泛使用的基线模型。针对优化轨迹漂移的理论分析与实验结果共同验证了 SVR 方法的正确性与有效性。值得注意的是,本方法具有高效性:无需额外训练数据或推理计算,且仅给训练过程增加可忽略的开销。