Urbanization challenges underscore the necessity for effective satellite image-text retrieval methods to swiftly access specific information enriched with geographic semantics for urban applications. However, existing methods often overlook significant domain gaps across diverse urban landscapes, primarily focusing on enhancing retrieval performance within single domains. To tackle this issue, we present UrbanCross, a new framework for cross-domain satellite image-text retrieval. UrbanCross leverages a high-quality, cross-domain dataset enriched with extensive geo-tags from three countries to highlight domain diversity. It employs the Large Multimodal Model (LMM) for textual refinement and the Segment Anything Model (SAM) for visual augmentation, achieving a fine-grained alignment of images, segments and texts, yielding a 10% improvement in retrieval performance. Additionally, UrbanCross incorporates an adaptive curriculum-based source sampler and a weighted adversarial cross-domain fine-tuning module, progressively enhancing adaptability across various domains. Extensive experiments confirm UrbanCross's superior efficiency in retrieval and adaptation to new urban environments, demonstrating an average performance increase of 15% over its version without domain adaptation mechanisms, effectively bridging the domain gap.
翻译:城市化挑战凸显了开发高效卫星图像-文本检索方法的必要性,以快速获取富含地理语义的特定信息支持城市应用。然而现有方法往往忽视不同城市场景间的显著域差异,主要聚焦于提升单一域内的检索性能。针对此问题,本文提出UrbanCross——一种面向跨域卫星图像-文本检索的新框架。UrbanCross利用包含三国广泛地理标签的高质量跨域数据集凸显域多样性,采用大型多模态模型(LMM)进行文本精炼,并借助Segment Anything Model(SAM)增强视觉表征,实现图像、分割结果与文本的细粒度对齐,检索性能提升10%。此外,UrbanCross引入基于自适应课程学习的源域采样器与加权对抗性跨域微调模块,逐步增强跨域适应能力。大量实验证实UrbanCross在检索效率及新城市环境适配方面的卓越表现,相较于未采用域适应机制的版本,平均性能提升15%,有效弥合了域差异。