In an effort to catalog insect biodiversity, we propose a new large dataset of hand-labelled insect images, the BIOSCAN-Insect Dataset. Each record is taxonomically classified by an expert, and also has associated genetic information including raw nucleotide barcode sequences and assigned barcode index numbers, which are genetically-based proxies for species classification. This paper presents a curated million-image dataset, primarily to train computer-vision models capable of providing image-based taxonomic assessment, however, the dataset also presents compelling characteristics, the study of which would be of interest to the broader machine learning community. Driven by the biological nature inherent to the dataset, a characteristic long-tailed class-imbalance distribution is exhibited. Furthermore, taxonomic labelling is a hierarchical classification scheme, presenting a highly fine-grained classification problem at lower levels. Beyond spurring interest in biodiversity research within the machine learning community, progress on creating an image-based taxonomic classifier will also further the ultimate goal of all BIOSCAN research: to lay the foundation for a comprehensive survey of global biodiversity. This paper introduces the dataset and explores the classification task through the implementation and analysis of a baseline classifier.
翻译:为系统记录昆虫生物多样性,我们提出一个大规模人工标注昆虫图像数据集——BIOSCAN-昆虫数据集。每条记录均由专家进行物种分类鉴定,并附有相关遗传信息,包括原始核苷酸条形码序列及基于遗传学的物种分类代理标识——条形码索引号。本文展示了一个经过筛选的百万级图像数据集,主要用于训练能够提供基于图像的物种分类评估的计算机视觉模型,但同时该数据集也呈现出值得更广泛机器学习学界关注的显著特征。受数据集固有生物学特性驱动,其呈现典型的长尾类别不平衡分布特征。此外,物种分类标注采用层级分类体系,在较低层级构成高度细粒度分类问题。除激发机器学习学界对生物多样性研究的兴趣外,基于图像构建物种分类器的进展还将进一步推动BIOSCAN研究的最终目标:为全球生物多样性综合调查奠定基础。本文介绍该数据集,并通过基准分类器的实现与分析探索分类任务。