A Step Towards Worldwide Biodiversity Assessment: The BIOSCAN-1M Insect Dataset

Zahra Gharaee,ZeMing Gong,Nicholas Pellegrino,Iuliia Zarubiieva,Joakim Bruslund Haurum,Scott C. Lowe,Jaclyn T. A. McKeown,Chris C. Y. Ho,Joschka McLeod,Yi-Yun C Wei,Jireh Agda,Sujeevan Ratnasingham,Dirk Steinke,Angel X. Chang,Graham W. Taylor,Paul Fieguth

In an effort to catalog insect biodiversity, we propose a new large dataset of hand-labelled insect images, the BIOSCAN-Insect Dataset. Each record is taxonomically classified by an expert, and also has associated genetic information including raw nucleotide barcode sequences and assigned barcode index numbers, which are genetically-based proxies for species classification. This paper presents a curated million-image dataset, primarily to train computer-vision models capable of providing image-based taxonomic assessment, however, the dataset also presents compelling characteristics, the study of which would be of interest to the broader machine learning community. Driven by the biological nature inherent to the dataset, a characteristic long-tailed class-imbalance distribution is exhibited. Furthermore, taxonomic labelling is a hierarchical classification scheme, presenting a highly fine-grained classification problem at lower levels. Beyond spurring interest in biodiversity research within the machine learning community, progress on creating an image-based taxonomic classifier will also further the ultimate goal of all BIOSCAN research: to lay the foundation for a comprehensive survey of global biodiversity. This paper introduces the dataset and explores the classification task through the implementation and analysis of a baseline classifier.

翻译：为系统记录昆虫生物多样性，我们提出一个大规模人工标注昆虫图像数据集——BIOSCAN-昆虫数据集。每条记录均由专家进行物种分类鉴定，并附有相关遗传信息，包括原始核苷酸条形码序列及基于遗传学的物种分类代理标识——条形码索引号。本文展示了一个经过筛选的百万级图像数据集，主要用于训练能够提供基于图像的物种分类评估的计算机视觉模型，但同时该数据集也呈现出值得更广泛机器学习学界关注的显著特征。受数据集固有生物学特性驱动，其呈现典型的长尾类别不平衡分布特征。此外，物种分类标注采用层级分类体系，在较低层级构成高度细粒度分类问题。除激发机器学习学界对生物多样性研究的兴趣外，基于图像构建物种分类器的进展还将进一步推动BIOSCAN研究的最终目标：为全球生物多样性综合调查奠定基础。本文介绍该数据集，并通过基准分类器的实现与分析探索分类任务。

相关内容

数据集

关注 88

数据集，又称为资料集、数据集合或资料集合，是一种由数据所组成的集合。
Data set（或dataset）是一个数据的集合，通常以表格形式出现。每一列代表一个特定变量。每一行都对应于某一成员的数据集的问题。它列出的价值观为每一个变量，如身高和体重的一个物体或价值的随机数。每个数值被称为数据资料。对应于行数，该数据集的数据可能包括一个或多个成员。

【CVPR 2022】基于元内存传输的跨域少镜头语义分割，Remember the Difference: Cross-Domain Few-Shot Semantic Segmentation via Meta-Memory Transfer

专知会员服务

13+阅读 · 2022年3月12日

【NeurIPS2021】用于文本图表示学习的 GNN 嵌套 Transformer 模型：GraphFormers

专知会员服务

46+阅读 · 2021年11月24日

Linux导论，Introduction to Linux，96页ppt

专知会员服务

82+阅读 · 2020年7月26日

FlowQA: Grasping Flow in History for Conversational Machine Comprehension

专知会员服务

34+阅读 · 2019年10月18日