GridVAD: Open-Set Video Anomaly Detection via Spatial Reasoning over Stratified Frame Grids

Vision-Language Models (VLMs) are powerful open-set reasoners, yet their direct use as anomaly detectors in video surveillance is fragile: without calibrated anomaly priors, they alternate between missed detections and hallucinated false alarms. We argue the problem is not the VLM itself but how it is used. VLMs should function as anomaly proposers, generating open-set candidate descriptions that are then grounded and tracked by purpose-built spatial and temporal modules. We instantiate this propose-ground-propagate principle in GridVAD, a training-free pipeline that produces pixel-level anomaly masks without any domain-specific training. A VLM reasons over stratified grid representations of video clips to generate natural-language anomaly proposals. Self-Consistency Consolidation (SCC) filters hallucinations by retaining only proposals that recur across multiple independent samplings. Grounding DINO anchors each surviving proposal to a bounding box, and SAM2 propagates it as a dense mask through the anomaly interval. The per-clip VLM budget is fixed at M+1 calls regardless of video length, where M can be set according to the proposals needed. On UCSD Ped2, GridVAD achieves the highest Pixel-AUROC (77.59) among all compared methods, surpassing even the partially fine-tuned TAO (75.11) and outperforms other zero-shot approaches on object-level RBDC by over 5x. Ablations reveal that SCC provides a controllable precision-recall tradeoff: filtering improves all pixel level metrics at a modest cost in object-level recall. Efficiency experiments show GridVAD is 2.7x more call-efficient than uniform per-frame VLM querying while additionally producing dense segmentation masks.Code and qualitative video results are available at https://gridvad.github.io.

翻译：视觉语言模型（VLM）虽具备强大的开集推理能力，但直接将其用作视频监控中的异常检测器存在脆弱性：由于缺乏校准的异常先验，它们会在漏检与幻觉误报之间摇摆。我们认为问题不在于VLM本身，而在于使用方式。VLM应作为异常提议器运行，生成开集候选描述，再由专用时空模块进行锚定与追踪。我们在GridVAD中实现了这一“提议-锚定-传播”原则——无需任何领域特定训练，即可生成像素级异常掩膜的无训练流水线。VLM基于视频片段的分层网格表征进行推理，生成自然语言异常提议。自一致性整合（SCC）通过仅保留在多次独立采样中重现的提议来过滤幻觉。Grounding DINO将每个存活提议锚定至边界框，SAM2则在异常时间区间内将其传播为密集掩膜。无论视频长度如何，每个片段的VLM调用预算固定为M+1次，其中M可根据所需提议数调整。在UCSD Ped2数据集上，GridVAD的像素级AUROC（77.59）超越所有对比方法，甚至超过部分微调的TAO模型（75.11），并在物体级RBDC指标上以5倍优势领先其他零样本方法。消融实验表明，SCC提供了可控的精确率-召回率权衡：过滤操作在轻微牺牲物体级召回率的前提下，全面提升了所有像素级指标。效率实验显示，GridVAD在生成密集分割掩膜的同时，调用效率是统一逐帧VLM查询的2.7倍。代码与定性视频结果参见 https://gridvad.github.io。