Video-grounded dialogue systems aim to integrate video understanding and dialogue understanding to generate responses that are relevant to both the dialogue and video context. Most existing approaches employ deep learning models and have achieved remarkable performance, given the relatively small datasets available. However, the results are partly accomplished by exploiting biases in the datasets rather than developing multimodal reasoning, resulting in limited generalization. In this paper, we propose a novel approach of Compositional Counterfactual Contrastive Learning ($C^3$) to develop contrastive training between factual and counterfactual samples in video-grounded dialogues. Specifically, we design factual/counterfactual sampling based on the temporal steps in videos and tokens in dialogues and propose contrastive loss functions that exploit object-level or action-level variance. Different from prior approaches, we focus on contrastive hidden state representations among compositional output tokens to optimize the representation space in a generation setting. We achieved promising performance gains on the Audio-Visual Scene-Aware Dialogues (AVSD) benchmark and showed the benefits of our approach in grounding video and dialogue context.
翻译:视频对话系统旨在融合视频理解与对话理解,生成与对话和视频上下文相关的回复。现有方法大多采用深度学习模型,并在相对较小的数据集上取得了显著性能。然而,部分性能提升源于对数据集中偏差的利用而非多模态推理能力的发展,导致泛化能力受限。本文提出一种名为组合式反事实对比学习($C^3$)的新方法,通过构建视频对话中事实样本与反事实样本的对比训练来实现突破。具体而言,我们基于视频时间步长和对话标记设计事实/反事实采样策略,并提出利用物体级或动作级差异的对比损失函数。与先前方法不同,我们聚焦于组合式输出标记间的对比隐状态表征,以优化生成场景下的表征空间。在音视频场景感知对话(AVSD)基准测试中,我们取得了显著的性能提升,并展示了该方法在视频与对话上下文对齐中的优势。