We propose a metadata-aware self-supervised learning~(SSL)~framework useful for fine-grained classification and ecological mapping of bird species around the world. Our framework unifies two SSL strategies: Contrastive Learning~(CL) and Masked Image Modeling~(MIM), while also enriching the embedding space with metadata available with ground-level imagery of birds. We separately train uni-modal and cross-modal ViT on a novel cross-view global bird species dataset containing ground-level imagery, metadata (location, time), and corresponding satellite imagery. We demonstrate that our models learn fine-grained and geographically conditioned features of birds, by evaluating on two downstream tasks: fine-grained visual classification~(FGVC) and cross-modal retrieval. Pre-trained models learned using our framework achieve SotA performance on FGVC of iNAT-2021 birds and in transfer learning settings for CUB-200-2011 and NABirds datasets. Moreover, the impressive cross-modal retrieval performance of our model enables the creation of species distribution maps across any geographic region. The dataset and source code will be released at https://github.com/mvrl/BirdSAT}.
翻译:我们提出了一种元数据感知的自监督学习(SSL)框架,适用于全球鸟类物种的细粒度分类与生态绘图。该框架统一了两种SSL策略:对比学习(CL)和掩码图像建模(MIM),同时利用地面鸟类图像中的元数据丰富嵌入空间。我们分别在一个包含地面图像、元数据(地理位置、时间)及对应卫星图像的新型跨视图全球鸟类物种数据集上,训练了单模态与跨模态ViT模型。通过评估细粒度视觉分类(FGVC)和跨模态检索两项下游任务,我们证明了模型能够学习鸟类细粒度且受地理条件约束的特征。利用本框架预训练的模型在iNAT-2021鸟类FGVC任务以及CUB-200-2011和NABirds数据集迁移学习场景中达到了SotA性能。此外,模型出色的跨模态检索性能支持生成任意地理区域的物种分布图。数据集与源代码将在https://github.com/mvrl/BirdSAT 发布。