Vision-Language Models in Remote Sensing: Current Progress and Future Trends

The remarkable achievements of ChatGPT and GPT-4 have sparked a wave of interest and research in the field of large language models for Artificial General Intelligence (AGI). These models provide us with intelligent solutions that are more similar to human thinking, enabling us to use general artificial intelligence to solve problems in various applications. However, in the field of remote sensing, the scientific literature on the implementation of AGI remains relatively scant. Existing AI-related research primarily focuses on visual understanding tasks while neglecting the semantic understanding of the objects and their relationships. This is where vision-language models excel, as they enable reasoning about images and their associated textual descriptions, allowing for a deeper understanding of the underlying semantics. Vision-language models can go beyond recognizing the objects in an image and can infer the relationships between them, as well as generate natural language descriptions of the image. This makes them better suited for tasks that require both visual and textual understanding, such as image captioning, text-based image retrieval, and visual question answering. This paper provides a comprehensive review of the research on vision-language models in remote sensing, summarizing the latest progress, highlighting the current challenges, and identifying potential research opportunities. Specifically, we review the application of vision-language models in several mainstream remote sensing tasks, including image captioning, text-based image generation, text-based image retrieval, visual question answering, scene classification, semantic segmentation, and object detection. For each task, we briefly describe the task background and review some representative works. Finally, we summarize the limitations of existing work and provide some possible directions for future development.

翻译：ChatGPT和GPT-4的卓越成就掀起了一股对通用人工智能（AGI）大语言模型的研究热潮。这些模型为我们提供了更接近人类思维的智能解决方案，使我们能够利用通用人工智能解决各类应用问题。然而在遥感领域，关于AGI实现的科学文献仍相对匮乏。现有AI相关研究主要聚焦于视觉理解任务，忽视了对象及其关系的语义理解。这正是视觉-语言模型的优势所在——它们能够对图像及其相关文本描述进行推理，从而实现对底层语义的深层理解。视觉-语言模型不仅能识别图像中的对象，还能推断对象间关系，并生成图像的自然语言描述。这使其更适用于需要视觉与文本双重理解的任务，如图像描述、基于文本的图像检索和视觉问答。本文全面综述了遥感领域视觉-语言模型的研究进展，总结了最新成果，指出了当前挑战，并明确了潜在研究机遇。具体而言，我们回顾了视觉-语言模型在多项主流遥感任务中的应用，包括图像描述、基于文本的图像生成、基于文本的图像检索、视觉问答、场景分类、语义分割和目标检测。针对每项任务，我们简要阐述任务背景并评述代表性工作。最后，我们总结了现有研究局限，并提出若干未来发展方向。

相关内容

MoDELS

关注 45

ACM/IEEE第23届模型驱动工程语言和系统国际会议，是模型驱动软件和系统工程的首要会议系列，由ACM-SIGSOFT和IEEE-TCSE支持组织。自1998年以来，模型涵盖了建模的各个方面，从语言和方法到工具和应用程序。模特的参加者来自不同的背景，包括研究人员、学者、工程师和工业专业人士。MODELS 2019是一个论坛，参与者可以围绕建模和模型驱动的软件和系统交流前沿研究成果和创新实践经验。今年的版本将为建模社区提供进一步推进建模基础的机会，并在网络物理系统、嵌入式系统、社会技术系统、云计算、大数据、机器学习、安全、开源等新兴领域提出建模的创新应用以及可持续性。官网链接：http://www.modelsconference.org/