小米EV-AD VLA团队：面向跨模态无人机导航的标题引导检索系统——IROS 2025 RoboSense挑战赛赛道4技术报告 (Team Xiaomi EV-AD VLA: Caption-Guided Retrieval System for Cross-Modal Drone Navigation - Technical Report for IROS 2025 RoboSense Challenge Track 4)

Cross-modal drone navigation remains a challenging task in robotics, requiring efficient retrieval of relevant images from large-scale databases based on natural language descriptions. The RoboSense 2025 Track 4 challenge addresses this challenge, focusing on robust, natural language-guided cross-view image retrieval across multiple platforms (drones, satellites, and ground cameras). Current baseline methods, while effective for initial retrieval, often struggle to achieve fine-grained semantic matching between text queries and visual content, especially in complex aerial scenes. To address this challenge, we propose a two-stage retrieval refinement method: Caption-Guided Retrieval System (CGRS) that enhances the baseline coarse ranking through intelligent reranking. Our method first leverages a baseline model to obtain an initial coarse ranking of the top 20 most relevant images for each query. We then use Vision-Language-Model (VLM) to generate detailed captions for these candidate images, capturing rich semantic descriptions of their visual content. These generated captions are then used in a multimodal similarity computation framework to perform fine-grained reranking of the original text query, effectively building a semantic bridge between the visual content and natural language descriptions. Our approach significantly improves upon the baseline, achieving a consistent 5\% improvement across all key metrics (Recall@1, Recall@5, and Recall@10). Our approach win TOP-2 in the challenge, demonstrating the practical value of our semantic refinement strategy in real-world robotic navigation scenarios.

翻译：跨模态无人机导航在机器人领域仍是一项具有挑战性的任务，其核心在于依据自然语言描述从大规模数据库中高效检索相关图像。RoboSense 2025赛道4挑战赛针对这一难题，聚焦于跨多平台（无人机、卫星与地面摄像机）的鲁棒性自然语言引导跨视角图像检索。现有基线方法虽能实现初步检索，但在复杂空中场景下，往往难以达成文本查询与视觉内容间的细粒度语义匹配。为应对这一挑战，我们提出一种两阶段检索优化方法：标题引导检索系统（CGRS），通过智能重排序增强基线粗排序结果。该方法首先利用基线模型为每个查询获取前20个最相关图像的初始粗排序列表；随后采用视觉语言模型（VLM）为这些候选图像生成详细标题，以捕捉其视觉内容的丰富语义描述；最后在多模态相似度计算框架中，利用生成的标题对原始文本查询进行细粒度重排序，从而在视觉内容与自然语言描述间有效构建语义桥梁。我们的方法在基线模型基础上取得显著提升，在所有关键指标（Recall@1、Recall@5与Recall@10）上均实现5%的稳定提升，并在挑战赛中荣获第二名，证明了语义优化策略在实际机器人导航场景中的实用价值。