BIOSCAN-5M: A Multimodal Dataset for Insect Biodiversity

Zahra Gharaee,Scott C. Lowe,ZeMing Gong,Pablo Millan Arias,Nicholas Pellegrino,Austin T. Wang,Joakim Bruslund Haurum,Iuliia Zarubiieva,Lila Kari,Dirk Steinke,Graham W. Taylor,Paul Fieguth,Angel X. Chang

As part of an ongoing worldwide effort to comprehend and monitor insect biodiversity, this paper presents the BIOSCAN-5M Insect dataset to the machine learning community and establish several benchmark tasks. BIOSCAN-5M is a comprehensive dataset containing multi-modal information for over 5 million insect specimens, and it significantly expands existing image-based biological datasets by including taxonomic labels, raw nucleotide barcode sequences, assigned barcode index numbers, and geographical information. We propose three benchmark experiments to demonstrate the impact of the multi-modal data types on the classification and clustering accuracy. First, we pretrain a masked language model on the DNA barcode sequences of the \mbox{BIOSCAN-5M} dataset, and demonstrate the impact of using this large reference library on species- and genus-level classification performance. Second, we propose a zero-shot transfer learning task applied to images and DNA barcodes to cluster feature embeddings obtained from self-supervised learning, to investigate whether meaningful clusters can be derived from these representation embeddings. Third, we benchmark multi-modality by performing contrastive learning on DNA barcodes, image data, and taxonomic information. This yields a general shared embedding space enabling taxonomic classification using multiple types of information and modalities. The code repository of the BIOSCAN-5M Insect dataset is available at {\url{https://github.com/zahrag/BIOSCAN-5M}}

翻译：作为全球范围内持续努力理解和监测昆虫生物多样性的一部分，本文向机器学习社区介绍了BIOSCAN-5M昆虫数据集，并建立了若干基准任务。BIOSCAN-5M是一个综合性数据集，包含超过500万个昆虫标本的多模态信息，它通过纳入分类学标签、原始核苷酸条形码序列、分配的条形码索引号以及地理信息，显著扩展了现有的基于图像的生物数据集。我们提出了三项基准实验，以展示多模态数据类型对分类和聚类精度的影响。首先，我们在BIOSCAN-5M数据集的DNA条形码序列上预训练了一个掩码语言模型，并证明了使用这个大型参考库对物种和属级分类性能的影响。其次，我们提出了一个应用于图像和DNA条形码的零样本迁移学习任务，对通过自监督学习获得的特征嵌入进行聚类，以探究是否可以从这些表征嵌入中推导出有意义的聚类。第三，我们通过对DNA条形码、图像数据和分类学信息进行对比学习，对多模态性进行了基准测试。这产生了一个通用的共享嵌入空间，使得能够利用多种类型的信息和模态进行分类学分类。BIOSCAN-5M昆虫数据集的代码仓库可在 {\url{https://github.com/zahrag/BIOSCAN-5M}} 获取。

相关内容

数据集

关注 88

数据集，又称为资料集、数据集合或资料集合，是一种由数据所组成的集合。
Data set（或dataset）是一个数据的集合，通常以表格形式出现。每一列代表一个特定变量。每一行都对应于某一成员的数据集的问题。它列出的价值观为每一个变量，如身高和体重的一个物体或价值的随机数。每个数值被称为数据资料。对应于行数，该数据集的数据可能包括一个或多个成员。

【CVPR 2022】一种无需使用负样本的自监督学习方法，Self-Supervised Predictive Learning: A Negative-Free Method for Sound Source Localization in Visual Scenes

专知会员服务

15+阅读 · 2022年3月12日

【NeurIPS2021】用于文本图表示学习的 GNN 嵌套 Transformer 模型：GraphFormers

专知会员服务

46+阅读 · 2021年11月24日

FlowQA: Grasping Flow in History for Conversational Machine Comprehension

专知会员服务

34+阅读 · 2019年10月18日

Auto-Sizing the Transformer Network: Improving Speed, Efficiency, and Performance for Low-Resource Machine Translation

专知会员服务

50+阅读 · 2019年10月17日