We live in a world filled with never-ending streams of multimodal information. As a more natural recording of the real scenario, long form audio-visual videos are expected as an important bridge for better exploring and understanding the world. In this paper, we propose the multisensory temporal event localization task in long form videos and strive to tackle the associated challenges. To facilitate this study, we first collect a large-scale Long Form Audio-visual Video (LFAV) dataset with 5,175 videos and an average video length of 210 seconds. Each of the collected videos is elaborately annotated with diversified modality-aware events, in a long-range temporal sequence. We then propose an event-centric framework for localizing multisensory events as well as understanding their relations in long form videos. It includes three phases in different levels: snippet prediction phase to learn snippet features, event extraction phase to extract event-level features, and event interaction phase to study event relations. Experiments demonstrate that the proposed method, utilizing the new LFAV dataset, exhibits considerable effectiveness in localizing multiple modality-aware events within long form videos. Project website: http://gewu-lab.github.io/LFAV/
翻译:我们生活在一个充满无穷无尽多模态信息流的世界中。作为真实场景更自然的记录方式,长形式音视频被认为是更好地探索和理解世界的重要桥梁。本文提出长形式视频中的多感官时态事件定位任务,并致力于解决相关挑战。为促进此项研究,我们首先收集了一个大规模长形式音视频(LFAV)数据集,包含5,175个视频,平均视频长度为210秒。每个收集的视频均被精心标注了多样化的模态感知事件,并以长时时间序列呈现。随后,我们提出一个以事件为中心的框架,用于定位多感官事件并理解其在长形式视频中的关系。该框架包含三个不同层次的阶段:片段预测阶段以学习片段特征,事件提取阶段以提取事件级特征,以及事件交互阶段以研究事件关系。实验表明,所提出的方法利用新的LFAV数据集,在长形式视频中定位多个模态感知事件方面展现出显著的有效性。项目网站:http://gewu-lab.github.io/LFAV/