Multimodal large language models (MLLMs) have demonstrated strong performance on vision-language tasks, yet their effectiveness on multimodal sentiment analysis remains constrained by the scarcity of high-quality training data, which limits accurate multimodal understanding and generalization. To alleviate this bottleneck, we leverage diffusion models to perform semantics-preserving augmentation on the video and audio modalities, expanding the multimodal training distribution. However, increasing data quantity alone is insufficient, as diffusion-generated samples exhibit substantial quality variation and noisy augmentations may degrade performance. We therefore propose DaQ-MSA (Denoising and Qualifying Diffusion Augmentations for Multimodal Sentiment Analysis), which introduces a quality scoring module to evaluate the reliability of augmented samples and assign adaptive training weights. By down-weighting low-quality samples and emphasizing high-fidelity ones, DaQ-MSA enables more stable learning. By integrating the generative capability of diffusion models with the semantic understanding of MLLMs, our approach provides a robust and generalizable automated augmentation strategy for training MLLMs without any human annotation or additional supervision.
翻译:多模态大语言模型(MLLMs)在视觉-语言任务上已展现出强大性能,但其在多模态情感分析中的有效性仍受限于高质量训练数据的稀缺性,这制约了模型对多模态内容的准确理解与泛化能力。为缓解这一瓶颈,我们利用扩散模型对视频与音频模态进行语义保持的数据增强,从而扩展多模态训练分布。然而,单纯增加数据量并不足够,因为扩散生成的样本存在显著的质量波动,而噪声增强可能损害模型性能。为此,我们提出DaQ-MSA(面向多模态情感分析的扩散增强去噪与质量评估),该方法引入一个质量评分模块来评估增强样本的可靠性并分配自适应训练权重。通过降低低质量样本的权重并强调高保真样本,DaQ-MSA实现了更稳定的学习。通过将扩散模型的生成能力与多模态大语言模型的语义理解相结合,我们的方法为训练多模态大语言模型提供了一种无需人工标注或额外监督的鲁棒且可泛化的自动增强策略。