Automatic image captioning is a promising technique for conveying visual information using natural language. It can benefit various tasks in satellite remote sensing, such as environmental monitoring, resource management, disaster management, etc. However, one of the main challenges in this domain is the lack of large-scale image-caption datasets, as they require a lot of human expertise and effort to create. Recent research on large language models (LLMs) has demonstrated their impressive performance in natural language understanding and generation tasks. Nonetheless, most of them cannot handle images (GPT-3.5, Falcon, Claude, etc.), while conventional captioning models pre-trained on general ground-view images often fail to produce detailed and accurate captions for aerial images (BLIP, GIT, CM3, CM3Leon, etc.). To address this problem, we propose a novel approach: Automatic Remote Sensing Image Captioning (ARSIC) to automatically collect captions for remote sensing images by guiding LLMs to describe their object annotations. We also present a benchmark model that adapts the pre-trained generative image2text model (GIT) to generate high-quality captions for remote-sensing images. Our evaluation demonstrates the effectiveness of our approach for collecting captions for remote sensing images.
翻译:自动图像描述是一种利用自然语言传递视觉信息的前沿技术,可惠及卫星遥感领域的多项任务,如环境监测、资源管理、灾害应对等。然而,该领域的主要挑战之一是缺乏大规模图像-文本配对数据集,因为创建此类数据集需要大量专业知识和人力投入。近期大语言模型的研究表明,其在自然语言理解与生成任务中表现卓越。但多数模型无法处理图像(如GPT-3.5、Falcon、Claude等),而基于通用地面视角图像预训练的常规描述模型(如BLIP、GIT、CM3、CM3Leon等)往往难以生成针对航拍图像的精准详细描述。为解决这一问题,我们提出一种新方法:自动遥感图像描述生成(ARSIC),通过引导大语言模型描述目标标注信息,自动为遥感图像采集标题。同时,我们提出一个基准模型,通过适配预训练生成式图像-文本模型(GIT)为遥感图像生成高质量描述。实验评估证明了该方法在遥感图像标题采集中的有效性。