Video anomaly detection (VAD) has witnessed significant advancements through the integration of large language models (LLMs) and vision-language models (VLMs), addressing critical challenges such as interpretability, temporal reasoning, and generalization in dynamic, open-world scenarios. This paper presents an in-depth review of cutting-edge LLM-/VLM-based methods in 2024, focusing on four key aspects: (i) enhancing interpretability through semantic insights and textual explanations, making visual anomalies more understandable; (ii) capturing intricate temporal relationships to detect and localize dynamic anomalies across video frames; (iii) enabling few-shot and zero-shot detection to minimize reliance on large, annotated datasets; and (iv) addressing open-world and class-agnostic anomalies by using semantic understanding and motion features for spatiotemporal coherence. We highlight their potential to redefine the landscape of VAD. Additionally, we explore the synergy between visual and textual modalities offered by LLMs and VLMs, highlighting their combined strengths and proposing future directions to fully exploit the potential in enhancing video anomaly detection.
翻译:视频异常检测(VAD)通过整合大型语言模型(LLMs)和视觉语言模型(VLMs)取得了显著进展,解决了动态开放世界场景中可解释性、时序推理和泛化等关键挑战。本文对2024年基于LLM/VLM的前沿方法进行了深入综述,重点关注四个方面:(i)通过语义洞察和文本解释增强可解释性,使视觉异常更易于理解;(ii)捕捉复杂的时序关系,以检测并定位跨视频帧的动态异常;(iii)实现少样本和零样本检测,以最小化对大规模标注数据集的依赖;(iv)利用语义理解和运动特征实现时空一致性,以应对开放世界和类别无关的异常。我们强调了这些方法在重塑VAD格局方面的潜力。此外,我们探讨了LLMs和VLMs所提供的视觉与文本模态之间的协同作用,突出了它们的综合优势,并提出了未来方向,以充分挖掘其在增强视频异常检测方面的潜力。