Automatic identification of events and recurrent behavior analysis are critical for video surveillance. However, most existing content-based video retrieval benchmarks focus on scene-level similarity and do not evaluate the action discrimination required in surveillance. To address this gap, we introduce SOVABench (Surveillance Opposite Vehicle Actions Benchmark), a real-world retrieval benchmark built from surveillance footage and centered on vehicle-related actions. SOVABench defines two evaluation protocols (inter-pair and intra-pair) to assess cross-action discrimination and temporal direction understanding. Although action distinctions are generally intuitive for human observers, our experiments show that they remain challenging for state-of-the-art vision and multimodal models. Leveraging the visual reasoning and instruction-following capabilities of Multimodal Large Language Models (MLLMs), we present a training-free framework for producing interpretable embeddings from MLLM-generated descriptions for both images and videos. The framework achieves strong performance on SOVABench as well as on several spatial and counting benchmarks where contrastive Vision-Language Models often fail. The code, annotations, and instructions to construct the benchmark are publicly available.
翻译:事件自动识别与重复行为分析对于视频监控至关重要。然而,现有的大多数基于内容的视频检索基准侧重于场景级相似性,并未评估监控所需的行为判别能力。为弥补这一不足,我们提出了SOVABench(监控对向车辆行为基准),这是一个从监控录像构建、以车辆相关行为为中心的真实世界检索基准。SOVABench定义了两套评估协议(配对间与配对内),用以评估跨行为判别能力和时序方向理解能力。尽管行为区分对人类观察者而言通常是直观的,但我们的实验表明,这对于最先进的视觉和多模态模型仍具挑战性。利用多模态大语言模型(MLLMs)的视觉推理和指令跟随能力,我们提出了一种免训练框架,用于从MLLM生成的图像和视频描述中产生可解释的嵌入表示。该框架不仅在SOVABench上取得了优异性能,在对比式视觉-语言模型常表现不佳的多个空间与计数基准测试中也表现强劲。构建该基准的代码、标注数据及说明均已公开。