In everyday communication, humans frequently use speech and gestures to refer to specific areas or objects, a process known as Referential Dialogue (RD). While prior studies have investigated RD through Large Language Models (LLMs) or Large Multimodal Models (LMMs) in static contexts, the exploration of Temporal Referential Dialogue (TRD) within audio-visual media remains limited. Two primary challenges hinder progress in this field: (1) the absence of comprehensive, untrimmed audio-visual video datasets with precise temporal annotations, and (2) the need for methods to integrate complex temporal auditory and visual cues effectively. To address these challenges, we introduce a novel framework to generate PU-VALOR, an extensive audio-visual dataset comprising over 114,000 untrimmed videos with accurate temporal demarcations. We also present AVicuna, featuring an Audio-Visual Tokens Interleaver (AVTI) that ensures the temporal alignment of audio-visual information. Additionally, we develop the A5-222K dataset, encompassing more than 200,000 audio-text pairings, to facilitate the audio and text alignments. Our experiments demonstrate that AVicuna can effectively handle TRD in audio-visual videos and achieve state-of-the-art performance on various audio-visual video understanding tasks, particularly in untrimmed videos. We further investigate the optimal audio-interleaving rate for interleaved audio-visual inputs, which maximizes performance on the Audio-Visual Event Dense Localization task.
翻译:在日常交流中,人类频繁使用语言和手势来指代特定区域或物体,这一过程被称为指代对话。尽管先前的研究已通过大语言模型或大型多模态模型在静态场景中探索了指代对话,但在音频-视觉媒介中挖掘时间指代对话的研究仍十分有限。该领域面临两大核心挑战:(1) 缺乏包含精确时间标注的、未裁剪音频-视觉视频数据集;(2) 需开发有效融合复杂时序听觉与视觉线索的方法。为应对这些挑战,我们提出一种新型框架以生成PU-VALOR——包含超过11.4万个未裁剪视频及其精确时间边界的海量音频-视觉数据集。我们还提出了AVicuna模型,其核心组件为音频-视觉令牌交织器,可确保音频与视觉信息的时间对齐。此外,我们构建了包含20余万条音频-文本配对的A5-222K数据集,以促进音频与文本的对齐。实验表明,AVicuna能有效处理音频-视觉视频中的时间指代对话,并在多项音频-视觉视频理解任务中取得最先进性能,尤其在未裁剪视频场景中表现突出。我们进一步探究了最优音频交织率,该参数可最大化音频-视觉事件密集定位任务的性能。