As part of an ongoing worldwide effort to comprehend and monitor insect biodiversity, this paper presents the BIOSCAN-5M Insect dataset to the machine learning community and establish several benchmark tasks. BIOSCAN-5M is a comprehensive dataset containing multi-modal information for over 5 million insect specimens, and it significantly expands existing image-based biological datasets by including taxonomic labels, raw nucleotide barcode sequences, assigned barcode index numbers, and geographical information. We propose three benchmark experiments to demonstrate the impact of the multi-modal data types on the classification and clustering accuracy. First, we pretrain a masked language model on the DNA barcode sequences of the \mbox{BIOSCAN-5M} dataset, and demonstrate the impact of using this large reference library on species- and genus-level classification performance. Second, we propose a zero-shot transfer learning task applied to images and DNA barcodes to cluster feature embeddings obtained from self-supervised learning, to investigate whether meaningful clusters can be derived from these representation embeddings. Third, we benchmark multi-modality by performing contrastive learning on DNA barcodes, image data, and taxonomic information. This yields a general shared embedding space enabling taxonomic classification using multiple types of information and modalities. The code repository of the BIOSCAN-5M Insect dataset is available at {\url{https://github.com/zahrag/BIOSCAN-5M}}
翻译:作为全球范围内持续努力理解和监测昆虫生物多样性的一部分,本文向机器学习社区介绍了BIOSCAN-5M昆虫数据集,并建立了若干基准任务。BIOSCAN-5M是一个综合性数据集,包含超过500万个昆虫标本的多模态信息,它通过纳入分类学标签、原始核苷酸条形码序列、分配的条形码索引号以及地理信息,显著扩展了现有的基于图像的生物数据集。我们提出了三项基准实验,以展示多模态数据类型对分类和聚类精度的影响。首先,我们在BIOSCAN-5M数据集的DNA条形码序列上预训练了一个掩码语言模型,并证明了使用这个大型参考库对物种和属级分类性能的影响。其次,我们提出了一个应用于图像和DNA条形码的零样本迁移学习任务,对通过自监督学习获得的特征嵌入进行聚类,以探究是否可以从这些表征嵌入中推导出有意义的聚类。第三,我们通过对DNA条形码、图像数据和分类学信息进行对比学习,对多模态性进行了基准测试。这产生了一个通用的共享嵌入空间,使得能够利用多种类型的信息和模态进行分类学分类。BIOSCAN-5M昆虫数据集的代码仓库可在 {\url{https://github.com/zahrag/BIOSCAN-5M}} 获取。