Diagnosing and cleaning data is a crucial step for building robust machine learning systems. However, identifying problems within large-scale datasets with real-world distributions is challenging due to the presence of complex issues such as label errors, under-representation, and outliers. In this paper, we propose a unified approach for identifying the problematic data by utilizing a largely ignored source of information: a relational structure of data in the feature-embedded space. To this end, we present scalable and effective algorithms for detecting label errors and outlier data based on the relational graph structure of data. We further introduce a visualization tool that provides contextual information of a data point in the feature-embedded space, serving as an effective tool for interactively diagnosing data. We evaluate the label error and outlier/out-of-distribution (OOD) detection performances of our approach on the large-scale image, speech, and language domain tasks, including ImageNet, ESC-50, and MNLI. Our approach achieves state-of-the-art detection performance on all tasks considered and demonstrates its effectiveness in debugging large-scale real-world datasets across various domains.
翻译:数据诊断与清洗是构建稳健机器学习系统的关键步骤。然而,针对具有真实世界分布的大规模数据集,由于存在标签错误、表征不足及异常值等复杂问题,识别其中的异常极具挑战性。本文提出一种统一方法,通过利用特征嵌入空间中长期被忽视的数据关系结构信息来识别问题数据。为此,我们设计了基于数据关系图结构的可扩展高效算法,用于检测标签错误和异常数据。进一步引入可视化工具,在特征嵌入空间中呈现数据点的上下文信息,可作为交互式诊断数据的有效手段。我们在涵盖图像、语音和语言领域的大规模任务(包括ImageNet、ESC-50和MNLI)上评估了标签错误检测与异常/分布外(OOD)检测性能。所提方法在所有任务中均取得了当前最优检测性能,展现了其在跨领域大规模真实世界数据集调试中的有效性。