Arboretum: A Large Multimodal Dataset Enabling AI for Biodiversity

Chih-Hsuan Yang,Benjamin Feuer,Zaki Jubery,Zi K. Deng,Andre Nakkab,Md Zahid Hasan,Shivani Chiranjeevi,Kelly Marshall,Nirmal Baishnab,Asheesh K Singh,Arti Singh,Soumik Sarkar,Nirav Merchant,Chinmay Hegde,Baskar Ganapathysubramanian

from arxiv, Preprint under review

We introduce Arboretum, the largest publicly accessible dataset designed to advance AI for biodiversity applications. This dataset, curated from the iNaturalist community science platform and vetted by domain experts to ensure accuracy, includes 134.6 million images, surpassing existing datasets in scale by an order of magnitude. The dataset encompasses image-language paired data for a diverse set of species from birds (Aves), spiders/ticks/mites (Arachnida), insects (Insecta), plants (Plantae), fungus/mushrooms (Fungi), snails (Mollusca), and snakes/lizards (Reptilia), making it a valuable resource for multimodal vision-language AI models for biodiversity assessment and agriculture research. Each image is annotated with scientific names, taxonomic details, and common names, enhancing the robustness of AI model training. We showcase the value of Arboretum by releasing a suite of CLIP models trained using a subset of 40 million captioned images. We introduce several new benchmarks for rigorous assessment, report accuracy for zero-shot learning, and evaluations across life stages, rare species, confounding species, and various levels of the taxonomic hierarchy. We anticipate that Arboretum will spur the development of AI models that can enable a variety of digital tools ranging from pest control strategies, crop monitoring, and worldwide biodiversity assessment and environmental conservation. These advancements are critical for ensuring food security, preserving ecosystems, and mitigating the impacts of climate change. Arboretum is publicly available, easily accessible, and ready for immediate use. Please see the \href{https://baskargroup.github.io/Arboretum/}{project website} for links to our data, models, and code.

翻译：我们推出Arboretum，这是目前规模最大的公开可访问数据集，旨在推动生物多样性应用领域的人工智能发展。该数据集从iNaturalist社区科学平台收集并经领域专家审核以确保准确性，包含1.346亿张图像，在规模上超越现有数据集一个数量级。数据集涵盖鸟类（Aves）、蜘蛛/蜱螨（Arachnida）、昆虫（Insecta）、植物（Plantae）、真菌（Fungi）、蜗牛（Mollusca）以及蛇类/蜥蜴（Reptilia）等多类物种的图文配对数据，为生物多样性评估和农业研究的多模态视觉-语言人工智能模型提供了宝贵资源。每张图像均标注有科学名称、分类学细节和常用名，增强了人工智能模型训练的鲁棒性。我们通过发布一套基于4000万带标注图像子集训练的CLIP模型，展示了Arboretum的价值。我们建立了多个用于严格评估的新基准，报告了零样本学习的准确率，并评估了生命阶段、稀有物种、易混淆物种以及分类学层级体系不同层面的性能。我们预计Arboretum将推动人工智能模型的发展，这些模型能够支持从害虫防治策略、作物监测到全球生物多样性评估和环境保护的各种数字工具。这些进步对于保障粮食安全、保护生态系统和减缓气候变化影响至关重要。Arboretum已公开提供，易于访问并可立即使用。请访问\href{https://baskargroup.github.io/Arboretum/}{项目网站}获取数据、模型和代码链接。