Diagnosing and cleaning data is a crucial step for building robust machine learning systems. However, identifying problems within large-scale datasets with real-world distributions is challenging due to the presence of complex issues such as label errors, under-representation, and outliers. In this paper, we propose a unified approach for identifying the problematic data by utilizing a largely ignored source of information: a relational structure of data in the feature-embedded space. To this end, we present scalable and effective algorithms for detecting label errors and outlier data based on the relational graph structure of data. We further introduce a visualization tool that provides contextual information of a data point in the feature-embedded space, serving as an effective tool for interactively diagnosing data. We evaluate the label error and outlier/out-of-distribution (OOD) detection performances of our approach on the large-scale image, speech, and language domain tasks, including ImageNet, ESC-50, and SST2. Our approach achieves state-of-the-art detection performance on all tasks considered and demonstrates its effectiveness in debugging large-scale real-world datasets across various domains. We release codes at https://github.com/snu-mllab/Neural-Relation-Graph.
翻译:诊断与清洗数据是构建稳健机器学习系统的关键步骤。然而,由于存在标签错误、欠表征和异常值等复杂问题,在具有真实世界分布的大规模数据集中识别问题数据极具挑战性。本文提出一种统一方法,通过利用特征嵌入空间中长期被忽视的数据关系结构信息来识别问题数据。为此,我们基于数据的关联图结构,开发了可扩展且高效的算法用于检测标签错误和异常数据。进一步地,我们引入一种可视化工具,该工具能提供数据点在特征嵌入空间中的上下文信息,成为交互式诊断数据的有效手段。我们在大规模图像、语音和语言领域任务(包括ImageNet、ESC-50和SST2)上评估了该方法在标签错误检测和异常/分布外(OOD)检测方面的性能。该方法在所有任务上均达到最先进的检测性能,并证明了其在跨领域大规模真实世界数据集调试中的有效性。相关代码已开源至 https://github.com/snu-mllab/Neural-Relation-Graph。