While Audio-Visual Language Models (AVLMs) have achieved remarkable progress over recent years, their reliability is bottlenecked by cross-modal hallucination. A particularly pervasive manifestation is video-driven audio hallucination: models routinely exploit visual shortcuts to hallucinate expected sounds, discarding true auditory evidence. To counteract this deeply ingrained visual dominance, we propose Audio-Contrastive Preference Optimization (ACPO). This dual-axis preference learning framework introduces an output-contrastive objective to penalize visual descriptions masquerading as audio facts, alongside an input-contrastive objective that swaps audio tracks to explicitly penalize generation invariant to the true auditory signal. Extensive experiments demonstrate that ACPO establishes highly faithful audio grounding and mitigates audio hallucination without compromising overarching multimodal capabilities.
翻译:尽管音视频语言模型(AVLMs)近年来取得了显著进展,但其可靠性受限于跨模态幻觉。一种尤为普遍的表现是视频驱动音频幻觉:模型惯于利用视觉捷径来幻觉预期声音,忽略真实听觉证据。为对抗这种根深蒂固的视觉支配性,我们提出音频对比偏好优化(ACPO)。这一双轴偏好学习框架引入输出对比目标,以惩罚伪装成音频事实的视觉描述;同时引入输入对比目标,通过交换音频轨道显式惩罚对真实音频信号不变的生成。大量实验表明,ACPO在建立高度可信的音频地面实况的同时,能有效缓解音频幻觉且不损害整体多模态能力。