Few-Shot Remote Sensing Image Scene Classification with CLIP and Prompt Learning

Remote sensing applications increasingly rely on deep learning for scene classification. However, their performance is often constrained by the scarcity of labeled data and the high cost of annotation across diverse geographic and sensor domains. While recent vision-language models like CLIP have shown promise by learning transferable representations at scale by aligning visual and textual modalities, their direct application to remote sensing remains suboptimal due to significant domain gaps and the need for task-specific semantic adaptation. To address this critical challenge, we systematically explore prompt learning as a lightweight and efficient adaptation strategy for few-shot remote sensing image scene classification. We evaluate several representative methods, including Context Optimization, Conditional Context Optimization, Multi-modal Prompt Learning, and Prompting with Self-Regulating Constraints. These approaches reflect complementary design philosophies: from static context optimization to conditional prompts for enhanced generalization, multi-modal prompts for joint vision-language adaptation, and semantically regularized prompts for stable learning without forgetting. We benchmark these prompt-learning methods against two standard baselines: zero-shot CLIP with hand-crafted prompts and a linear probe trained on frozen CLIP features. Through extensive experiments on multiple benchmark remote sensing datasets, including cross-dataset generalization tests, we demonstrate that prompt learning consistently outperforms both baselines in few-shot scenarios. Notably, Prompting with Self-Regulating Constraints achieves the most robust cross-domain performance. Our findings underscore prompt learning as a scalable and efficient solution for bridging the domain gap in satellite and aerial imagery, providing a strong foundation for future research in this field.

翻译：遥感应用日益依赖深度学习进行场景分类，但其性能常受限于标注数据稀缺以及跨不同地理与传感器领域标注成本高昂的问题。尽管近期如CLIP等视觉-语言模型通过对齐视觉与文本模态学习可迁移的大规模表征展现出潜力，但由于显著的领域差异及任务特定语义适配需求，其直接应用于遥感领域仍非最优。为应对这一关键挑战，我们系统性地探索了提示学习作为一种轻量高效的适配策略，用于少样本遥感图像场景分类。我们评估了多种代表性方法，包括上下文优化、条件上下文优化、多模态提示学习以及自约束提示学习。这些方法体现了互补的设计理念：从静态上下文优化到增强泛化能力的条件提示、面向联合视觉-语言适配的多模态提示，以及通过语义正则化实现稳定学习且避免遗忘的提示策略。我们将这些提示学习方法与两种标准基线进行对比：基于手工提示的零样本CLIP，以及在冻结CLIP特征上训练的线性探针。通过在多个基准遥感数据集（包括跨数据集泛化测试）上的大量实验，我们证明提示学习在少样本场景中持续优于两种基线方法。值得注意的是，自约束提示学习实现了最稳健的跨领域性能。我们的研究结果强调了提示学习作为弥合卫星与航空影像领域差异的可扩展高效解决方案，为该领域未来研究奠定了坚实基础。