Video anomaly detection (VAD) has been paid increasing attention due to its potential applications, its current dominant tasks focus on online detecting anomalies% at the frame level, which can be roughly interpreted as the binary or multiple event classification. However, such a setup that builds relationships between complicated anomalous events and single labels, e.g., ``vandalism'', is superficial, since single labels are deficient to characterize anomalous events. In reality, users tend to search a specific video rather than a series of approximate videos. Therefore, retrieving anomalous events using detailed descriptions is practical and positive but few researches focus on this. In this context, we propose a novel task called Video Anomaly Retrieval (VAR), which aims to pragmatically retrieve relevant anomalous videos by cross-modalities, e.g., language descriptions and synchronous audios. Unlike the current video retrieval where videos are assumed to be temporally well-trimmed with short duration, VAR is devised to retrieve long untrimmed videos which may be partially relevant to the given query. To achieve this, we present two large-scale VAR benchmarks, UCFCrime-AR and XDViolence-AR, constructed on top of prevalent anomaly datasets. Meanwhile, we design a model called Anomaly-Led Alignment Network (ALAN) for VAR. In ALAN, we propose an anomaly-led sampling to focus on key segments in long untrimmed videos. Then, we introduce an efficient pretext task to enhance semantic associations between video-text fine-grained representations. Besides, we leverage two complementary alignments to further match cross-modal contents. Experimental results on two benchmarks reveal the challenges of VAR task and also demonstrate the advantages of our tailored method.
翻译:视频异常检测(VAD)因其潜在应用而日益受到关注,其当前主要任务聚焦于在帧级别上在线检测异常,这大致可解释为二元或多事件分类。然而,这种在复杂异常事件与单一标签(如“破坏公物”)之间建立关系的设置较为肤浅,因为单一标签不足以描述异常事件的特征。现实中,用户更倾向于搜索特定视频,而非一系列近似视频。因此,通过详细描述检索异常事件具有实际意义和积极价值,但相关研究较少。针对这一背景,我们提出了一项名为视频异常检索(VAR)的新任务,旨在通过跨模态方式(如语言描述和同步音频)实际检索相关的异常视频。与当前假设视频经过时间精修且时长较短的视频检索不同,VAR旨在检索可能与给定查询部分相关的长未修剪视频。为实现这一目标,我们在现有流行异常数据集基础上构建了两个大规模VAR基准:UCFCrime-AR和XDViolence-AR。同时,我们设计了一个名为异常引导对齐网络(ALAN)的模型用于VAR。在ALAN中,我们提出异常引导采样方法,以聚焦长未修剪视频中的关键片段。随后,我们引入一项高效的前置任务,以增强视频-文本细粒度表示之间的语义关联。此外,我们利用两种互补对齐方式进一步匹配跨模态内容。在两个基准上的实验结果表明了VAR任务的挑战性,也展现了我们定制方法的优势。