AgriCLIP: Adapting CLIP for Agriculture and Livestock via Domain-Specialized Cross-Model Alignment

Capitalizing on vast amount of image-text data, large-scale vision-language pre-training has demonstrated remarkable zero-shot capabilities and has been utilized in several applications. However, models trained on general everyday web-crawled data often exhibit sub-optimal performance for specialized domains, likely due to domain shift. Recent works have tackled this problem for some domains (e.g., healthcare) by constructing domain-specialized image-text data. However, constructing a dedicated large-scale image-text dataset for sustainable area of agriculture and livestock is still open to research. Further, this domain desires fine-grained feature learning due to the subtle nature of the downstream tasks (e.g, nutrient deficiency detection, livestock breed classification). To address this we present AgriCLIP, a vision-language foundational model dedicated to the domain of agriculture and livestock. First, we propose a large-scale dataset, named ALive, that leverages customized prompt generation strategy to overcome the scarcity of expert annotations. Our ALive dataset covers crops, livestock, and fishery, with around 600,000 image-text pairs. Second, we propose a training pipeline that integrates both contrastive and self-supervised learning to learn both global semantic and local fine-grained domain-specialized features. Experiments on diverse set of 20 downstream tasks demonstrate the effectiveness of AgriCLIP framework, achieving an absolute gain of 7.8\% in terms of average zero-shot classification accuracy, over the standard CLIP adaptation via domain-specialized ALive dataset. Our ALive dataset and code can be accessible at \href{https://github.com/umair1221/AgriCLIP/tree/main}{Github}.

翻译：大规模视觉语言预训练模型利用海量图文数据，已展现出卓越的零样本能力，并在多个应用领域得到部署。然而，基于通用日常网络爬取数据训练的模型在专业领域中常因领域偏移而表现欠佳。近期研究通过构建领域专业化的图文数据集，已在部分领域（如医疗健康）应对了此问题。然而，针对农业与畜牧业这一可持续发展领域，构建专用的大规模图文数据集仍是待探索的课题。此外，由于下游任务（如营养缺乏检测、牲畜品种分类）具有细微特征差异，该领域亟需细粒度特征学习。为此，我们提出AgriCLIP——一个专注于农业与畜牧业领域的视觉语言基础模型。首先，我们提出了名为ALive的大规模数据集，该数据集采用定制化的提示生成策略以克服专家标注稀缺的问题。我们的ALive数据集涵盖农作物、畜牧业及渔业领域，包含约60万条图文对。其次，我们提出一种融合对比学习与自监督学习的训练框架，以同时学习全局语义特征与局部细粒度领域专业化特征。在涵盖20个下游任务的多样化实验集上的测试表明，AgriCLIP框架具有显著有效性：相较于通过领域专业化ALive数据集进行标准CLIP适配的方法，其在平均零样本分类准确率上实现了7.8%的绝对提升。我们的ALive数据集与代码可通过\href{https://github.com/umair1221/AgriCLIP/tree/main}{Github}获取。

相关内容

数据集

关注 88

数据集，又称为资料集、数据集合或资料集合，是一种由数据所组成的集合。
Data set（或dataset）是一个数据的集合，通常以表格形式出现。每一列代表一个特定变量。每一行都对应于某一成员的数据集的问题。它列出的价值观为每一个变量，如身高和体重的一个物体或价值的随机数。每个数值被称为数据资料。对应于行数，该数据集的数据可能包括一个或多个成员。

【CVPR 2022】一个完全无监督的框架，从噪声和部分测量中学习图像，Robust Equivariant Imaging: a fully unsupervised framework for learning to image

专知会员服务

25+阅读 · 2022年3月3日

Linux导论，Introduction to Linux，96页ppt

专知会员服务

82+阅读 · 2020年7月26日

FlowQA: Grasping Flow in History for Conversational Machine Comprehension

专知会员服务

34+阅读 · 2019年10月18日

Auto-Sizing the Transformer Network: Improving Speed, Efficiency, and Performance for Low-Resource Machine Translation

专知会员服务

50+阅读 · 2019年10月17日