Despite impressive advancements in video understanding, most efforts remain limited to coarse-grained or visual-only video tasks. However, real-world videos encompass omni-modal information (vision, audio, and speech) with a series of events forming a cohesive storyline. The lack of multi-modal video data with fine-grained event annotations and the high cost of manual labeling are major obstacles to comprehensive omni-modality video perception. To address this gap, we propose an automatic pipeline consisting of high-quality multi-modal video filtering, semantically coherent omni-modal event boundary detection, and cross-modal correlation-aware event captioning. In this way, we present LongVALE, the first-ever Vision-Audio-Language Event understanding benchmark comprising 105K omni-modal events with precise temporal boundaries and detailed relation-aware captions within 8.4K high-quality long videos. Further, we build a baseline that leverages LongVALE to enable video large language models (LLMs) for omni-modality fine-grained temporal video understanding for the first time. Extensive experiments demonstrate the effectiveness and great potential of LongVALE in advancing comprehensive multi-modal video understanding.
翻译:尽管视频理解领域取得了显著进展,但现有研究大多仍局限于粗粒度或纯视觉的视频任务。然而,真实世界视频包含全模态信息(视觉、音频与语音),并通过一系列事件构成连贯的叙事线索。缺乏具有细粒度事件标注的多模态视频数据以及高昂的人工标注成本,是实现全面全模态视频感知的主要障碍。为填补这一空白,我们提出一种自动化流程,包含高质量多模态视频筛选、语义连贯的全模态事件边界检测以及跨模态关联感知的事件描述生成。基于此,我们构建了首个视觉-音频-语言-事件理解基准LongVALE,该基准包含8.4K个高质量长视频中的105K个全模态事件,每个事件均具有精确的时间边界与细节丰富的关联感知描述。进一步,我们建立了首个利用LongVALE驱动视频大语言模型实现全模态细粒度时序视频理解的基线系统。大量实验证明了LongVALE在推动全面多模态视频理解方面的有效性与巨大潜力。