Efficient Deduplication and Leakage Detection in Large Scale Image Datasets with a focus on the CrowdAI Mapping Challenge Dataset

Recent advancements in deep learning and computer vision have led to widespread use of deep neural networks to extract building footprints from remote-sensing imagery. The success of such methods relies on the availability of large databases of high-resolution remote sensing images with high-quality annotations. The CrowdAI Mapping Challenge Dataset is one of these datasets that has been used extensively in recent years to train deep neural networks. This dataset consists of $ \sim\ $280k training images and $ \sim\ $60k testing images, with polygonal building annotations for all images. However, issues such as low-quality and incorrect annotations, extensive duplication of image samples, and data leakage significantly reduce the utility of deep neural networks trained on the dataset. Therefore, it is an imperative pre-condition to adopt a data validation pipeline that evaluates the quality of the dataset prior to its use. To this end, we propose a drop-in pipeline that employs perceptual hashing techniques for efficient de-duplication of the dataset and identification of instances of data leakage between training and testing splits. In our experiments, we demonstrate that nearly 250k($ \sim\ $90%) images in the training split were identical. Moreover, our analysis on the validation split demonstrates that roughly 56k of the 60k images also appear in the training split, resulting in a data leakage of 93%. The source code used for the analysis and de-duplication of the CrowdAI Mapping Challenge dataset is publicly available at https://github.com/yeshwanth95/CrowdAI_Hash_and_search .

翻译：近期深度学习和计算机视觉的进展推动深度神经网络广泛应用于从遥感影像中提取建筑足迹。此类方法的成功依赖于大规模高分辨率遥感图像数据库及其高质量标注的可用性。CrowdAI Mapping Challenge数据集是近年来常用于训练深度神经网络的此类数据集之一。该数据集包含约28万张训练图像和约6万张测试图像，所有图像均附带多边形建筑标注。然而，低质量与错误标注、图像样本大量重复以及数据泄漏等问题严重降低了基于该数据集训练的深度神经网络的实用性。因此，在使用该数据集前必须采用数据验证流程来评估其质量。为此，我们提出一个即插即用的流程，利用感知哈希技术实现对数据集的高效去重，并识别训练集与测试集之间的数据泄漏实例。实验表明，训练集中近25万张（约90%）图像存在重复。此外，对验证集的分析显示，约6万张图像中有5.6万张也出现在训练集中，导致93%的数据泄漏。用于CrowdAI Mapping Challenge数据集分析与去重的源代码已公开于 https://github.com/yeshwanth95/CrowdAI_Hash_and_search 。

相关内容

数据集

关注 88

数据集，又称为资料集、数据集合或资料集合，是一种由数据所组成的集合。
Data set（或dataset）是一个数据的集合，通常以表格形式出现。每一列代表一个特定变量。每一行都对应于某一成员的数据集的问题。它列出的价值观为每一个变量，如身高和体重的一个物体或价值的随机数。每个数值被称为数据资料。对应于行数，该数据集的数据可能包括一个或多个成员。

【吴恩达AAAI2022演讲】以数据为中心的人工智能，The Data-Centric AI

专知会员服务

77+阅读 · 2022年2月26日

【视频描述综述论文】Video Description: A Survey of Methods, Datasets, and Evaluation Metrics

专知会员服务

65+阅读 · 2020年5月12日

用于大型遥感影像检索的深度学习，Deep Learning for Image Search and Retrieval in Large Remote Sensing Archives

专知会员服务

39+阅读 · 2020年4月6日

【CVPR2020】通过潦草注释的弱监督显著目标检测，Weakly-Supervised Salient Object Detection via Scribble Annotations

专知会员服务

39+阅读 · 2020年3月19日