The role of data in building AI systems has recently been emphasized by the emerging concept of data-centric AI. Unfortunately, in the real-world, datasets may contain dirty samples, such as poisoned samples from backdoor attack, noisy labels in crowdsourcing, and even hybrids of them. The presence of such dirty samples makes the DNNs vunerable and unreliable.Hence, it is critical to detect dirty samples to improve the quality and realiability of dataset. Existing detectors only focus on detecting poisoned samples or noisy labels, that are often prone to weak generalization when dealing with dirty samples from other domains.In this paper, we find a commonality of various dirty samples is visual-linguistic inconsistency between images and associated labels. To capture the semantic inconsistency between modalities, we propose versatile data cleanser (VDC) leveraging the surpassing capabilities of multimodal large language models (MLLM) in cross-modal alignment and reasoning.It consists of three consecutive modules: the visual question generation module to generate insightful questions about the image; the visual question answering module to acquire the semantics of the visual content by answering the questions with MLLM; followed by the visual answer evaluation module to evaluate the inconsistency.Extensive experiments demonstrate its superior performance and generalization to various categories and types of dirty samples.
翻译:数据在构建人工智能系统中的作用近期因以数据为中心的AI这一新兴概念而受到重视。然而,现实世界中的数据集可能包含脏样本,例如来自后门攻击的投毒样本、众包中的噪声标签,甚至是两者的混合。此类脏样本的存在使得深度神经网络(DNNs)变得脆弱且不可靠。因此,检测脏样本以提高数据集的质量和可靠性至关重要。现有检测器仅专注于检测投毒样本或噪声标签,在处理来自其他领域的脏样本时往往泛化能力较弱。本文发现各类脏样本的共同特征在于图像与对应标签之间的视觉-语言不一致性。为捕捉模态间的语义不一致性,我们提出多功能数据清洗器(VDC),利用多模态大语言模型(MLLM)在跨模态对齐与推理中的卓越能力。该工具包含三个连续的模块:视觉问题生成模块,用于生成关于图像的有洞察力的问题;视觉问答模块,通过MLLM回答问题以获取视觉内容的语义;以及视觉答案评估模块,用于评估不一致性。大量实验表明,该方法在应对多种类别和类型的脏样本时具有优越性能与泛化能力。