The amount of image datasets collected for environmental monitoring purposes has increased in the past years as computer vision assisted methods have gained interest. Computer vision applications rely on high-quality datasets, making data curation important. However, data curation is often done ad-hoc and the methods used are rarely published. We present a method for curating large-scale image datasets of invertebrates that contain multiple images of the same taxa and/or specimens and have relatively uniform background in the images. Our approach is based on extracting feature embeddings with pretrained deep neural networks, and using these embeddings to find visually most distinct images by comparing their embeddings to the group prototype embedding. Also, we show that a simple area-based size comparison approach is able to find a lot of common erroneous images, such as images containing detached body parts and misclassified samples. In addition to the method, we propose using novel metrics for evaluating human-in-the-loop outlier detection methods. The implementations of the proposed curation methods, as well as a benchmark dataset containing annotated erroneous images, are publicly available in https://github.com/mikkoim/taxonomist-studio.
翻译:近年来,随着计算机视觉辅助方法受到关注,用于环境监测目的采集的图像数据量持续增长。计算机视觉应用依赖高质量数据集,这使得数据整理工作至关重要。然而,数据整理通常以临时方式进行,且所采用的方法鲜有公开。本文提出一种针对无脊椎动物大规模图像数据集的整理方法,此类数据集包含同一分类群和/或标本的多个图像,且图像背景相对统一。我们的方法基于预训练深度神经网络提取特征嵌入,并通过将各图像嵌入与组原型嵌入进行比较,从而找出视觉差异最显著的图像。此外,我们证明基于面积的简单尺寸比较方法能够有效识别常见错误图像,例如包含脱离躯体的身体部位图像和错误分类样本。除方法本身外,我们还提出使用新型指标来评估人机协同异常检测方法。所提整理方法的实现代码,以及包含标注错误图像的基准数据集,已在 https://github.com/mikkoim/taxonomist-studio 公开提供。