Text-based person anomaly search retrieves specific behavioral events from surveillance archives using natural-language queries. Although recent pose-aware methods align geometric structures well, they face a fundamental Pose-Semantic Gap: semantically different actions can share similar skeletal geometries. While Multimodal Large Language Models (MLLMs) can reduce this ambiguity, using them for large-scale retrieval is computationally prohibitive. We propose the Structure-Semantic Decoupled Cascade (SSDC) framework, which decouples retrieval into two stages: (1) Structure-Aware Coarse Retrieval, where a lightweight model quickly filters candidates by skeletal similarity ; and (2) Detective Squad Interaction, a multi-agent semantic verification module. The squad consists of a Detective for fast binary filtering, an Analyst for evidence extraction, and a Writer for semantic synthesis. Finally, we re-rank candidates by fusing the synthesized captions with structural priors. Experiments on the PAB benchmark show that SSDC achieves state-of-the-art performance by balancing efficiency and semantic reasoning.
翻译:基于文本的行人异常检索旨在通过自然语言查询从监控档案中检索特定行为事件。尽管近期基于姿态的方法能良好对齐几何结构,但它们面临根本性的姿态—语义鸿沟:语义不同的动作可能共享相似的骨骼几何形态。虽然多模态大语言模型能减少这种歧义,但将其用于大规模检索的计算成本过高。我们提出结构—语义解耦级联框架,将检索解耦为两个阶段:(1)结构感知粗检索,轻量级模型通过骨骼相似性快速筛选候选对象;(2)侦查小队交互机制,即多智能体语义验证模块。该小队由用于快速二元筛选的侦查员、用于证据提取的分析师和用于语义合成的记录员组成。最后,通过融合合成描述与结构先验对候选结果进行重排序。在PAB基准上的实验表明,SSDC通过平衡效率与语义推理达到了最先进性能。