The remarkable achievements of ChatGPT and GPT-4 have sparked a wave of interest and research in the field of large language models for Artificial General Intelligence (AGI). These models provide us with intelligent solutions that are more similar to human thinking, enabling us to use general artificial intelligence to solve problems in various applications. However, in the field of remote sensing, the scientific literature on the implementation of AGI remains relatively scant. Existing AI-related research primarily focuses on visual understanding tasks while neglecting the semantic understanding of the objects and their relationships. This is where vision-language models excel, as they enable reasoning about images and their associated textual descriptions, allowing for a deeper understanding of the underlying semantics. Vision-language models can go beyond recognizing the objects in an image and can infer the relationships between them, as well as generate natural language descriptions of the image. This makes them better suited for tasks that require both visual and textual understanding, such as image captioning, text-based image retrieval, and visual question answering. This paper provides a comprehensive review of the research on vision-language models in remote sensing, summarizing the latest progress, highlighting the current challenges, and identifying potential research opportunities. Specifically, we review the application of vision-language models in several mainstream remote sensing tasks, including image captioning, text-based image generation, text-based image retrieval, visual question answering, scene classification, semantic segmentation, and object detection. For each task, we briefly describe the task background and review some representative works. Finally, we summarize the limitations of existing work and provide some possible directions for future development.
翻译:ChatGPT和GPT-4的卓越成就掀起了一股对通用人工智能(AGI)大语言模型的研究热潮。这些模型为我们提供了更接近人类思维的智能解决方案,使我们能够利用通用人工智能解决各类应用问题。然而在遥感领域,关于AGI实现的科学文献仍相对匮乏。现有AI相关研究主要聚焦于视觉理解任务,忽视了对象及其关系的语义理解。这正是视觉-语言模型的优势所在——它们能够对图像及其相关文本描述进行推理,从而实现对底层语义的深层理解。视觉-语言模型不仅能识别图像中的对象,还能推断对象间关系,并生成图像的自然语言描述。这使其更适用于需要视觉与文本双重理解的任务,如图像描述、基于文本的图像检索和视觉问答。本文全面综述了遥感领域视觉-语言模型的研究进展,总结了最新成果,指出了当前挑战,并明确了潜在研究机遇。具体而言,我们回顾了视觉-语言模型在多项主流遥感任务中的应用,包括图像描述、基于文本的图像生成、基于文本的图像检索、视觉问答、场景分类、语义分割和目标检测。针对每项任务,我们简要阐述任务背景并评述代表性工作。最后,我们总结了现有研究局限,并提出若干未来发展方向。