GlobalWasteData: A Large-Scale, Integrated Dataset for Robust Waste Classification and Environmental Monitoring

The growing amount of waste is a problem for the environment that requires efficient sorting techniques for various kinds of waste. An automated waste classification system is used for this purpose. The effectiveness of these Artificial Intelligence (AI) models depends on the quality and accessibility of publicly available datasets, which provide the basis for training and analyzing classification algorithms. Although several public waste classification datasets exist, they remain fragmented, inconsistent, and biased toward specific environments. Differences in class names, annotation formats, image conditions, and class distributions make it difficult to combine these datasets or train models that generalize well to real world scenarios. To address these issues, we introduce the GlobalWasteData (GWD) archive, a large scale dataset of 89,807 images across 14 main categories, annotated with 68 distinct subclasses. We compile this novel integrated GWD archive by merging multiple publicly available datasets into a single, unified resource. This GWD archive offers consistent labeling, improved domain diversity, and more balanced class representation, enabling the development of robust and generalizable waste recognition models. Additional preprocessing steps such as quality filtering, duplicate removal, and metadata generation further improve dataset reliability. Overall, this dataset offers a strong foundation for Machine Learning (ML) applications in environmental monitoring, recycling automation, and waste identification, and is publicly available to promote future research and reproducibility.

翻译：日益增长的废物量已成为严峻的环境问题，亟需针对各类废物的高效分拣技术。自动化废物分类系统正被用于此目的。这些人工智能模型的有效性取决于公开可用数据集的质量与可及性，这些数据集为分类算法的训练与分析提供了基础。尽管存在若干公开的废物分类数据集，但它们仍呈现碎片化、不一致性及对特定环境的偏向性。类别名称、标注格式、图像条件与类别分布的差异，使得整合这些数据集或训练能够良好泛化至真实场景的模型变得困难。为解决这些问题，我们提出了全球废物数据档案——一个包含89,807张图像、涵盖14个主要类别并标注有68个独立子类的大规模数据集。我们通过整合多个公开可用数据集，构建了这一新颖的集成化GWD档案，使其成为统一的资源库。该GWD档案提供了一致的标注体系、增强的领域多样性及更均衡的类别表征，从而支持开发稳健且可泛化的废物识别模型。额外的预处理步骤（如质量过滤、重复项剔除及元数据生成）进一步提升了数据集的可靠性。总体而言，本数据集为环境监测、回收自动化及废物识别等机器学习应用提供了坚实基础，并已公开共享以促进未来研究与可复现性。

相关内容

数据集

关注 88

数据集，又称为资料集、数据集合或资料集合，是一种由数据所组成的集合。
Data set（或dataset）是一个数据的集合，通常以表格形式出现。每一列代表一个特定变量。每一行都对应于某一成员的数据集的问题。它列出的价值观为每一个变量，如身高和体重的一个物体或价值的随机数。每个数值被称为数据资料。对应于行数，该数据集的数据可能包括一个或多个成员。

大规模多模态模型数据集、应用类别与分类学综述

专知会员服务

58+阅读 · 2024年12月25日

谷歌最新《大语言模型合成数据的最佳实践和经验教训》

专知会员服务

66+阅读 · 2024年4月17日

重磅！《地球大数据白皮书（2023年）》74页pdf

专知会员服务

60+阅读 · 2023年10月10日