Remote sensing images are highly valued for their ability to address complex real-world issues such as risk management, security, and meteorology. However, manually captioning these images is challenging and requires specialized knowledge across various domains. This letter presents an approach for automatically describing (captioning) remote sensing images. We propose a novel encoder-decoder setup that deploys a Text Graph Convolutional Network (TextGCN) and multi-layer LSTMs. The embeddings generated by TextGCN enhance the decoder's understanding by capturing the semantic relationships among words at both the sentence and corpus levels. Furthermore, we advance our approach with a comparison-based beam search method to ensure fairness in the search strategy for generating the final caption. We present an extensive evaluation of our approach against various other state-of-the-art encoder-decoder frameworks. We evaluated our method across three datasets using seven metrics: BLEU-1 to BLEU-4, METEOR, ROUGE-L, and CIDEr. The results demonstrate that our approach significantly outperforms other state-of-the-art encoder-decoder methods.
翻译:遥感图像因其在解决风险管理、安全、气象等复杂现实问题方面的能力而备受重视。然而,人工描述这些图像具有挑战性,需要跨多个领域的专业知识。本文提出了一种自动描述(生成字幕)遥感图像的方法。我们提出了一种新颖的编码器-解码器架构,该架构部署了文本图卷积网络(TextGCN)和多层LSTM。TextGCN生成的嵌入通过捕获句子和语料库级别的词语间语义关系,增强了解码器的理解能力。此外,我们通过引入一种基于比较的束搜索方法改进了我们的方法,以确保生成最终描述时搜索策略的公平性。我们针对多种其他先进的编码器-解码器框架对我们的方法进行了广泛评估。我们在三个数据集上使用七种指标评估了我们的方法:BLEU-1至BLEU-4、METEOR、ROUGE-L和CIDEr。结果表明,我们的方法显著优于其他先进的编码器-解码器方法。