This paper focuses on the challenge of answering questions in scenarios that are composed of rich and complex dynamic audio-visual components. Although existing Multimodal Large Language Models (MLLMs) can respond to audio-visual content, these responses are sometimes ambiguous and fail to describe specific audio-visual events. To overcome this limitation, we introduce the CAT, which enhances MLLM in three ways: 1) besides straightforwardly bridging audio and video, we design a clue aggregator that aggregates question-related clues in dynamic audio-visual scenarios to enrich the detailed knowledge required for large language models. 2) CAT is trained on a mixed multimodal dataset, allowing direct application in audio-visual scenarios. Notably, we collect an audio-visual joint instruction dataset named AVinstruct, to further enhance the capacity of CAT to model cross-semantic correlations. 3) we propose AI-assisted ambiguity-aware direct preference optimization, a strategy specialized in retraining the model to favor the non-ambiguity response and improve the ability to localize specific audio-visual objects. Extensive experimental results demonstrate that CAT outperforms existing methods on multimodal tasks, especially in Audio-Visual Question Answering (AVQA) tasks. The codes and the collected instructions are released at https://github.com/rikeilong/Bay-CAT.
翻译:本文聚焦于在包含丰富复杂动态视听组件的场景中回答问题的挑战。尽管现有的多模态大语言模型(MLLMs)能够对视听内容做出回应,但这些回应有时存在歧义,且未能描述具体的视听事件。为克服这一局限,我们提出CAT,从三个方面增强MLLM:1)除了直接桥接音频与视频外,我们设计了一个线索聚合器,用于在动态视听场景中聚合与问题相关的线索,以丰富大语言模型所需的细粒度知识;2)CAT在混合多模态数据集上训练,可直接应用于视听场景。值得注意的是,我们收集了一个名为AVinstruct的视听联合指令数据集,以进一步增强CAT建模跨语义关联的能力;3)我们提出了基于AI辅助的歧义感知直接偏好优化策略,该策略专门用于重新训练模型,使其偏好无歧义的响应,并提升定位特定视听对象的能力。大量实验结果表明,CAT在多模态任务上优于现有方法,特别是在视听问答(AVQA)任务中表现突出。代码和所收集的指令已发布于https://github.com/rikeilong/Bay-CAT。