In geographical image segmentation, performance is often constrained by the limited availability of training data and a lack of generalizability, particularly for segmenting mobility infrastructure such as roads, sidewalks, and crosswalks. Vision foundation models like the Segment Anything Model (SAM), pre-trained on millions of natural images, have demonstrated impressive zero-shot segmentation performance, providing a potential solution. However, SAM struggles with geographical images, such as aerial and satellite imagery, due to its training being confined to natural images and the narrow features and textures of these objects blending into their surroundings. To address these challenges, we propose Geographical SAM (GeoSAM), a SAM-based framework that fine-tunes SAM with automatically generated multi-modal prompts, combining point prompts from a pre-trained task-specific model as primary visual guidance and text prompts from a large language model as secondary semantic guidance to enhance model comprehension. GeoSAM outperforms existing approaches for mobility infrastructure segmentation in both familiar and completely unseen regions by at least 5\% in mIoU, representing a significant leap in leveraging foundation models to segment mobility infrastructure, including both road and pedestrian infrastructure in geographical images. The source code can be found in this GitHub Repository: https://github.com/rafiibnsultan/GeoSAM.
翻译:在地理图像分割领域,性能往往受限于训练数据的有限可用性以及泛化能力的不足,尤其是在分割道路、人行道和斑马线等交通基础设施时。基于数百万自然图像预训练的视觉基础模型,如Segment Anything Model (SAM),已展现出卓越的零样本分割性能,为此提供了潜在的解决方案。然而,由于SAM的训练仅限于自然图像,且地理图像(如航拍和卫星影像)中目标物的特征与纹理较为狭窄、易与背景融合,导致其在该类图像上的分割效果不佳。为应对这些挑战,我们提出了地理SAM(GeoSAM),这是一个基于SAM的框架,通过自动生成的多模态提示对SAM进行微调。该框架结合了来自预训练任务特定模型生成的点提示作为主要视觉引导,以及来自大语言模型生成的文本提示作为辅助语义引导,以增强模型的理解能力。在熟悉区域和完全未见区域,GeoSAM在交通基础设施分割任务上的平均交并比(mIoU)均优于现有方法至少5%,标志着在利用基础模型分割地理图像中的交通基础设施(包括道路与行人基础设施)方面取得了显著进展。源代码可在以下GitHub仓库获取:https://github.com/rafiibnsultan/GeoSAM。