Capitalizing on vast amount of image-text data, large-scale vision-language pre-training has demonstrated remarkable zero-shot capabilities and has been utilized in several applications. However, models trained on general everyday web-crawled data often exhibit sub-optimal performance for specialized domains, likely due to domain shift. Recent works have tackled this problem for some domains (e.g., healthcare) by constructing domain-specialized image-text data. However, constructing a dedicated large-scale image-text dataset for sustainable area of agriculture and livestock is still open to research. Further, this domain desires fine-grained feature learning due to the subtle nature of the downstream tasks (e.g, nutrient deficiency detection, livestock breed classification). To address this we present AgriCLIP, a vision-language foundational model dedicated to the domain of agriculture and livestock. First, we propose a large-scale dataset, named ALive, that leverages customized prompt generation strategy to overcome the scarcity of expert annotations. Our ALive dataset covers crops, livestock, and fishery, with around 600,000 image-text pairs. Second, we propose a training pipeline that integrates both contrastive and self-supervised learning to learn both global semantic and local fine-grained domain-specialized features. Experiments on diverse set of 20 downstream tasks demonstrate the effectiveness of AgriCLIP framework, achieving an absolute gain of 7.8\% in terms of average zero-shot classification accuracy, over the standard CLIP adaptation via domain-specialized ALive dataset. Our ALive dataset and code can be accessible at \href{https://github.com/umair1221/AgriCLIP/tree/main}{Github}.
翻译:大规模视觉语言预训练模型利用海量图文数据,已展现出卓越的零样本能力,并在多个应用领域得到部署。然而,基于通用日常网络爬取数据训练的模型在专业领域中常因领域偏移而表现欠佳。近期研究通过构建领域专业化的图文数据集,已在部分领域(如医疗健康)应对了此问题。然而,针对农业与畜牧业这一可持续发展领域,构建专用的大规模图文数据集仍是待探索的课题。此外,由于下游任务(如营养缺乏检测、牲畜品种分类)具有细微特征差异,该领域亟需细粒度特征学习。为此,我们提出AgriCLIP——一个专注于农业与畜牧业领域的视觉语言基础模型。首先,我们提出了名为ALive的大规模数据集,该数据集采用定制化的提示生成策略以克服专家标注稀缺的问题。我们的ALive数据集涵盖农作物、畜牧业及渔业领域,包含约60万条图文对。其次,我们提出一种融合对比学习与自监督学习的训练框架,以同时学习全局语义特征与局部细粒度领域专业化特征。在涵盖20个下游任务的多样化实验集上的测试表明,AgriCLIP框架具有显著有效性:相较于通过领域专业化ALive数据集进行标准CLIP适配的方法,其在平均零样本分类准确率上实现了7.8%的绝对提升。我们的ALive数据集与代码可通过\href{https://github.com/umair1221/AgriCLIP/tree/main}{Github}获取。