Insect Identification in the Wild: The AMI Dataset

Aditya Jain,Fagner Cunha,Michael James Bunsen,Juan Sebastián Cañas,Léonard Pasi,Nathan Pinoy,Flemming Helsing,JoAnne Russo,Marc Botham,Michael Sabourin,Jonathan Fréchette,Alexandre Anctil,Yacksecari Lopez,Eduardo Navarro,Filonila Perez Pimentel,Ana Cecilia Zamora,José Alejandro Ramirez Silva,Jonathan Gagnon,Tom August,Kim Bjerge,Alba Gomez Segura,Marc Bélisle,Yves Basset,Kent P. McFarland,David Roy,Toke Thomas Høye,Maxim Larrivée,David Rolnick

Insects represent half of all global biodiversity, yet many of the world's insects are disappearing, with severe implications for ecosystems and agriculture. Despite this crisis, data on insect diversity and abundance remain woefully inadequate, due to the scarcity of human experts and the lack of scalable tools for monitoring. Ecologists have started to adopt camera traps to record and study insects, and have proposed computer vision algorithms as an answer for scalable data processing. However, insect monitoring in the wild poses unique challenges that have not yet been addressed within computer vision, including the combination of long-tailed data, extremely similar classes, and significant distribution shifts. We provide the first large-scale machine learning benchmarks for fine-grained insect recognition, designed to match real-world tasks faced by ecologists. Our contributions include a curated dataset of images from citizen science platforms and museums, and an expert-annotated dataset drawn from automated camera traps across multiple continents, designed to test out-of-distribution generalization under field conditions. We train and evaluate a variety of baseline algorithms and introduce a combination of data augmentation techniques that enhance generalization across geographies and hardware setups. Code and datasets are made publicly available.

翻译：昆虫占全球生物多样性的一半，然而全球许多昆虫正在消失，这对生态系统和农业产生了严重影响。尽管面临这一危机，由于人类专家的稀缺和缺乏可扩展的监测工具，关于昆虫多样性和丰度的数据仍然严重不足。生态学家已开始采用相机陷阱来记录和研究昆虫，并提出了计算机视觉算法作为可扩展数据处理的解决方案。然而，野外昆虫监测带来了计算机视觉领域尚未应对的独特挑战，包括长尾数据分布、极度相似的类别以及显著的数据分布偏移。我们首次提供了面向细粒度昆虫识别的大规模机器学习基准，旨在匹配生态学家面临的真实世界任务。我们的贡献包括：一个从公民科学平台和博物馆收集的精选图像数据集，以及一个从跨多个大陆的自动化相机陷阱中提取、经专家标注的数据集，该数据集专为测试野外条件下的分布外泛化能力而设计。我们训练并评估了多种基线算法，并引入了一系列数据增强技术组合，以提升跨地域和硬件设置的泛化性能。代码与数据集均已公开提供。

相关内容

数据集

关注 88

数据集，又称为资料集、数据集合或资料集合，是一种由数据所组成的集合。
Data set（或dataset）是一个数据的集合，通常以表格形式出现。每一列代表一个特定变量。每一行都对应于某一成员的数据集的问题。它列出的价值观为每一个变量，如身高和体重的一个物体或价值的随机数。每个数值被称为数据资料。对应于行数，该数据集的数据可能包括一个或多个成员。

O’Reilly报告：知识图谱崛起——面向现代数据集成和数据结构体系，“The Rise of the Knowledge Graph——Toward Modern Data Integration and the Data Fabric Architecture”

专知会员服务

49+阅读 · 2022年2月18日

生成性对抗网络:理论模型、评估指标和最近发展的概述，Generative Adversarial Networks (GANs): An Overview of Theoretical Model, Evaluation Metrics, and Recent Developments

专知会员服务

42+阅读 · 2020年5月30日

FlowQA: Grasping Flow in History for Conversational Machine Comprehension

专知会员服务

34+阅读 · 2019年10月18日

Auto-Sizing the Transformer Network: Improving Speed, Efficiency, and Performance for Low-Resource Machine Translation

专知会员服务

50+阅读 · 2019年10月17日