Existing vision-text contrastive learning models enhance representation transferability and support zero-shot prediction by matching paired image and caption embeddings while pushing unrelated pairs apart. However, astronomical image-label datasets are significantly smaller compared to general image and label datasets available from the internet. We introduce CosmoCLIP, an astronomical image-text contrastive learning framework precisely fine-tuned on the pre-trained CLIP model using SpaceNet and BLIP-based captions. SpaceNet, attained via FLARE, constitutes ~13k optimally distributed images, while BLIP acts as a rich knowledge extractor. The rich semantics derived from this SpaceNet and BLIP descriptions, when learned contrastively, enable CosmoCLIP to achieve superior generalization across various in-domain and out-of-domain tasks. Our results demonstrate that CosmoCLIP is a straightforward yet powerful framework, significantly outperforming CLIP in zero-shot classification and image-text retrieval tasks.
翻译:现有的视觉-文本对比学习模型通过匹配成对的图像与描述嵌入,同时推开无关的配对,从而增强表征的可迁移性并支持零样本预测。然而,天文图像-标签数据集相较于互联网上可获取的通用图像与标签数据集规模显著偏小。我们提出了CosmoCLIP,这是一个基于预训练CLIP模型、使用SpaceNet数据集和基于BLIP生成的描述进行精确微调的天文图像-文本对比学习框架。SpaceNet通过FLARE方法获取,包含约1.3万张最优分布的图像,而BLIP则充当丰富的知识提取器。从SpaceNet和BLIP描述中提取的丰富语义信息,通过对比学习进行建模,使得CosmoCLIP能够在多种领域内及跨领域任务中实现卓越的泛化性能。我们的实验结果表明,CosmoCLIP是一个简洁而强大的框架,在零样本分类和图像-文本检索任务上显著优于CLIP。