In this paper, we propose a solution for cross-modal transportation retrieval. Due to the cross-domain problem of traffic images, we divide the problem into two sub-tasks of pedestrian retrieval and vehicle retrieval through a simple strategy. In pedestrian retrieval tasks, we use IRRA as the base model and specifically design an Attribute Classification to mine the knowledge implied by attribute labels. More importantly, We use the strategy of Inclusion Relation Matching to make the image-text pairs with inclusion relation have similar representation in the feature space. For the vehicle retrieval task, we use BLIP as the base model. Since aligning the color attributes of vehicles is challenging, we introduce attribute-based object detection techniques to add color patch blocks to vehicle images for color data augmentation. This serves as strong prior information, helping the model perform the image-text alignment. At the same time, we incorporate labeled attributes into the image-text alignment loss to learn fine-grained alignment and prevent similar images and texts from being incorrectly separated. Our approach ranked first in the final B-board test with a score of 70.9.
翻译:本文提出了一种跨模态交通检索解决方案。针对交通图像存在的跨域问题,我们通过简单策略将问题分解为行人检索与车辆检索两个子任务。在行人检索任务中,我们以IRRA为基础模型,专门设计了属性分类模块以挖掘属性标签隐含的知识。更重要的是,我们采用包含关系匹配策略,使具有包含关系的图像-文本对在特征空间中具有相似的表示。对于车辆检索任务,我们以BLIP为基础模型。由于车辆颜色属性对齐具有挑战性,我们引入基于属性的目标检测技术,通过向车辆图像添加色块来进行颜色数据增强。这作为强先验信息,有助于模型执行图像-文本对齐。同时,我们将标注属性纳入图像-文本对齐损失中,以学习细粒度对齐,防止相似图像与文本被错误分离。我们的方法在最终B榜测试中以70.9分位列第一。