Vision Language Foundation Models based on CLIP architecture for remote sensing primarily rely on short text captions, which often result in incomplete semantic representations. Although longer captions convey richer information, existing models struggle to process them effectively because of limited text-encoding capacity, and there remains a shortage of resources that align remote sensing images with both short text and long text captions. To address this gap, we introduce DGTRSD, a dual-granularity remote sensing image-text dataset, where each image is paired with both a short text caption and a long text description, providing a solid foundation for dual-granularity semantic modeling. Based on this, we further propose DGTRS-CLIP, a dual-granularity curriculum learning framework that combines short text and long text supervision to achieve dual-granularity semantic alignment. Extensive experiments on four typical zero-shot tasks: long text cross-modal retrieval, short text cross-modal retrieval, image classification, and semantic localization demonstrate that DGTRS-CLIP consistently outperforms existing methods across all tasks. The code has been open-sourced and is available at https://github.com/MitsuiChen14/DGTRS.
翻译:基于CLIP架构的遥感视觉语言基础模型主要依赖短文本描述,这通常导致语义表征不完整。尽管长文本描述能传达更丰富的信息,但现有模型因文本编码能力有限而难以有效处理长文本,且同时对齐遥感图像与短/长文本描述的资源仍显不足。为填补这一空白,我们提出了DGTRSD——一个双粒度遥感图像-文本数据集,其中每幅图像均配有短文本标题与长文本描述,为双粒度语义建模提供了坚实基础。基于此,我们进一步提出DGTRS-CLIP:一种结合短文本与长文本监督的双粒度课程学习框架,以实现双粒度语义对齐。在长文本跨模态检索、短文本跨模态检索、图像分类和语义定位四项典型零样本任务上的大量实验表明,DGTRS-CLIP在所有任务中均持续优于现有方法。代码已开源,可通过https://github.com/MitsuiChen14/DGTRS获取。