The growing amount of waste is a problem for the environment that requires efficient sorting techniques for various kinds of waste. An automated waste classification system is used for this purpose. The effectiveness of these Artificial Intelligence (AI) models depends on the quality and accessibility of publicly available datasets, which provide the basis for training and analyzing classification algorithms. Although several public waste classification datasets exist, they remain fragmented, inconsistent, and biased toward specific environments. Differences in class names, annotation formats, image conditions, and class distributions make it difficult to combine these datasets or train models that generalize well to real world scenarios. To address these issues, we introduce the GlobalWasteData (GWD) archive, a large scale dataset of 89,807 images across 14 main categories, annotated with 68 distinct subclasses. We compile this novel integrated GWD archive by merging multiple publicly available datasets into a single, unified resource. This GWD archive offers consistent labeling, improved domain diversity, and more balanced class representation, enabling the development of robust and generalizable waste recognition models. Additional preprocessing steps such as quality filtering, duplicate removal, and metadata generation further improve dataset reliability. Overall, this dataset offers a strong foundation for Machine Learning (ML) applications in environmental monitoring, recycling automation, and waste identification, and is publicly available to promote future research and reproducibility.
翻译:日益增长的废物量已成为严峻的环境问题,亟需针对各类废物的高效分拣技术。自动化废物分类系统正被用于此目的。这些人工智能模型的有效性取决于公开可用数据集的质量与可及性,这些数据集为分类算法的训练与分析提供了基础。尽管存在若干公开的废物分类数据集,但它们仍呈现碎片化、不一致性及对特定环境的偏向性。类别名称、标注格式、图像条件与类别分布的差异,使得整合这些数据集或训练能够良好泛化至真实场景的模型变得困难。为解决这些问题,我们提出了全球废物数据档案——一个包含89,807张图像、涵盖14个主要类别并标注有68个独立子类的大规模数据集。我们通过整合多个公开可用数据集,构建了这一新颖的集成化GWD档案,使其成为统一的资源库。该GWD档案提供了一致的标注体系、增强的领域多样性及更均衡的类别表征,从而支持开发稳健且可泛化的废物识别模型。额外的预处理步骤(如质量过滤、重复项剔除及元数据生成)进一步提升了数据集的可靠性。总体而言,本数据集为环境监测、回收自动化及废物识别等机器学习应用提供了坚实基础,并已公开共享以促进未来研究与可复现性。