Text-Based Person Search (TBPS) aims to retrieve pedestrian images from large galleries using natural language descriptions. This task, essential for public safety applications, is hindered by cross-modal discrepancies and ambiguous user queries. We introduce CONQUER, a two-stage framework designed to address these challenges by enhancing cross-modal alignment during training and adaptively refining queries at inference. During training, CONQUER employs multi-granularity encoding, complementary pair mining, and context-guided optimal matching based on Optimal Transport to learn robust embeddings. At inference, a plug-and-play query enhancement module refines vague or incomplete queries via anchor selection and attribute-driven enrichment, without requiring retraining of the backbone. Extensive experiments on CUHK-PEDES, ICFG-PEDES, and RSTPReid demonstrate that CONQUER consistently outperforms strong baselines in both Rank-1 accuracy and mAP, yielding notable improvements in cross-domain and incomplete-query scenarios. These results highlight CONQUER as a practical and effective solution for real-world TBPS deployment. Source code is available at https://github.com/zqxie77/CONQUER.
翻译:基于文本的行人检索(TBPS)旨在通过自然语言描述从大规模图库中检索行人图像。该任务对公共安全应用至关重要,但受到跨模态差异和用户查询模糊性的制约。我们提出了CONQUER,一个两阶段框架,通过在训练阶段增强跨模态对齐,并在推理阶段自适应优化查询来应对这些挑战。在训练阶段,CONQUER采用多粒度编码、互补对挖掘以及基于最优传输的上下文引导最优匹配来学习鲁棒嵌入表示。在推理阶段,一个即插即用的查询增强模块通过锚点选择和属性驱动增强来优化模糊或不完整的查询,无需重新训练主干网络。在CUHK-PEDES、ICFG-PEDES和RSTPReid数据集上的大量实验表明,CONQUER在Rank-1准确率和mAP指标上均持续超越强基线模型,在跨域和不完整查询场景中取得了显著提升。这些结果证明CONQUER是实际部署TBPS系统的实用有效解决方案。源代码发布于https://github.com/zqxie77/CONQUER。