Text-based person search aims to retrieve specific individuals across camera networks using natural language descriptions. However, current benchmarks often exhibit biases towards common actions like walking or standing, neglecting the critical need for identifying abnormal behaviors in real-world scenarios. To meet such demands, we propose a new task, text-based person anomaly search, locating pedestrians engaged in both routine or anomalous activities via text. To enable the training and evaluation of this new task, we construct a large-scale image-text Pedestrian Anomaly Behavior (PAB) benchmark, featuring a broad spectrum of actions, e.g., running, performing, playing soccer, and the corresponding anomalies, e.g., lying, being hit, and falling of the same identity. The training set of PAB comprises 1,013,605 synthesized image-text pairs of both normalities and anomalies, while the test set includes 1,978 real-world image-text pairs. To validate the potential of PAB, we introduce a cross-modal pose-aware framework, which integrates human pose patterns with identity-based hard negative pair sampling. Extensive experiments on the proposed benchmark show that synthetic training data facilitates the fine-grained behavior retrieval, and the proposed pose-aware method arrives at 84.93% recall@1 accuracy, surpassing other competitive methods. The dataset, model, and code are available at https://github.com/Shuyu-XJTU/CMP.
翻译:基于文本的行人搜索旨在通过自然语言描述在摄像头网络中检索特定个体。然而,现有基准数据集常偏向行走或站立等常见动作,忽视了实际场景中识别异常行为的迫切需求。为满足这一需求,我们提出一项新任务——基于文本的行人异常搜索,即通过文本定位从事常规或异常活动的行人。为支持该任务的训练与评估,我们构建了大规模图文行人异常行为基准数据集,涵盖跑步、表演、踢足球等广泛常规动作,以及同一身份个体的躺卧、被撞击、摔倒等对应异常行为。PAB训练集包含1,013,605组涵盖正常与异常状态的合成图文对,测试集则包含1,978组真实场景图文对。为验证PAB的潜力,我们提出一种跨模态姿态感知框架,该框架将人体姿态模式与基于身份的困难负样本对采样相结合。在新建基准上的大量实验表明:合成训练数据有助于细粒度行为检索,所提出的姿态感知方法在Recall@1指标上达到84.93%的准确率,优于其他竞争方法。数据集、模型及代码已发布于https://github.com/Shuyu-XJTU/CMP。