Remote sensing (RS) images contain numerous objects of different scales, which poses significant challenges for the RS image change captioning (RSICC) task to identify visual changes of interest in complex scenes and describe them via language. However, current methods still have some weaknesses in sufficiently extracting and utilizing multi-scale information. In this paper, we propose a progressive scale-aware network (PSNet) to address the problem. PSNet is a pure Transformer-based model. To sufficiently extract multi-scale visual features, multiple progressive difference perception (PDP) layers are stacked to progressively exploit the differencing features of bitemporal features. To sufficiently utilize the extracted multi-scale features for captioning, we propose a scale-aware reinforcement (SR) module and combine it with the Transformer decoding layer to progressively utilize the features from different PDP layers. Experiments show that the PDP layer and SR module are effective and our PSNet outperforms previous methods.
翻译:遥感图像中包含大量不同尺度的物体,这给遥感图像变化描述(RSICC)任务带来了重大挑战——需要在复杂场景中识别感兴趣的视觉变化并通过语言进行描述。然而,现有方法在充分提取和利用多尺度信息方面仍存在不足。本文提出一种渐进式尺度感知网络(PSNet)来解决该问题。PSNet是一种纯Transformer模型。为充分提取多尺度视觉特征,通过堆叠多个渐进式差异感知(PDP)层,逐步挖掘双时相特征的差异信息。为充分利用所提取的多尺度特征进行描述生成,我们提出尺度感知增强(SR)模块,并将其与Transformer解码层结合,逐步利用来自不同PDP层的特征。实验表明,PDP层与SR模块均有效,且我们的PSNet性能优于现有方法。