Surveillance videos are an essential component of daily life with various critical applications, particularly in public security. However, current surveillance video tasks mainly focus on classifying and localizing anomalous events. Existing methods are limited to detecting and classifying the predefined events with unsatisfactory generalization ability and semantic understanding, although they have obtained considerable performance. To address this issue, we propose constructing the first multimodal surveillance video dataset by manually annotating the real-world surveillance dataset UCF-Crime with fine-grained event content and timing. Our newly annotated dataset, UCA (UCF-Crime Annotation), provides a novel benchmark for multimodal surveillance video analysis. It not only describes events in detailed descriptions but also provides precise temporal grounding of the events in 0.1-second intervals. UCA contains 20,822 sentences, with an average length of 23 words, and its annotated videos are as long as 102 hours. Furthermore, we benchmark the state-of-the-art models of multiple multimodal tasks on this newly created dataset, including temporal sentence grounding in videos, video captioning, and dense video captioning. Through our experiments, we found that mainstream models used in previously publicly available datasets perform poorly on multimodal surveillance video scenarios, which highlights the necessity of constructing this dataset. The link to our dataset and code is provided at: https://github.com/Xuange923/UCA-dataset.
翻译:监控视频是日常生活中不可或缺的组成部分,在公共安全等领域具有多种关键应用。然而,当前监控视频任务主要聚焦于异常事件的分类与定位。现有方法虽已取得显著性能,但仍局限于对预定义事件的检测与分类,存在泛化能力不足和语义理解局限的问题。为解决这一难题,我们提出通过人工标注真实监控数据集UCF-Crime中的细粒度事件内容与时间信息,构建首个多模态监控视频数据集。新标注的数据集UCA(UCF-Crime Annotation)为多模态监控视频分析提供了全新基准。该数据集不仅以详细描述记录事件,还提供以0.1秒间隔标注的精确事件时间定位。UCA包含20,822条语句,平均长度23个词,标注视频总时长102小时。此外,我们在此数据集上对多模态任务的多种最先进模型进行了基准测试,包括视频时序句子定位、视频描述生成及密集视频描述生成。实验发现,现有公开数据集中使用的主流模型在多模态监控视频场景下表现不佳,这凸显了构建该数据集的必要性。数据集与代码链接见:https://github.com/Xuange923/UCA-dataset。