The success of NLP systems often relies on the availability of large, high-quality datasets. However, not all samples in these datasets are equally valuable for learning, as some may be redundant or noisy. Several methods for characterizing datasets based on model-driven meta-information (e.g., model's confidence) have been developed, but the relationship and complementary effects of these methods have received less attention. In this paper, we introduce infoVerse, a universal framework for dataset characterization, which provides a new feature space that effectively captures multidimensional characteristics of datasets by incorporating various model-driven meta-information. infoVerse reveals distinctive regions of the dataset that are not apparent in the original semantic space, hence guiding users (or models) in identifying which samples to focus on for exploration, assessment, or annotation. Additionally, we propose a novel sampling method on infoVerse to select a set of data points that maximizes informativeness. In three real-world applications (data pruning, active learning, and data annotation), the samples chosen on infoVerse space consistently outperform strong baselines in all applications. Our code and demo are publicly available.
翻译:自然语言处理系统的成功通常依赖于大规模高质量数据集的可用性。然而,这些数据集中的样本并非都对学习具有同等价值,部分样本可能存在冗余或噪声。现有研究已开发出多种基于模型驱动元信息(如模型置信度)的数据集表征方法,但这些方法之间的关联性与互补效应尚未得到充分关注。本文提出infoVerse——一种通用的数据集表征框架,通过融合多种模型驱动元信息构建新的特征空间,有效捕捉数据集的多维特性。infoVerse能够揭示原始语义空间中不可见的独特数据区域,从而指导用户(或模型)识别需要重点探索、评估或标注的样本。此外,我们提出一种基于infoVerse的创新采样方法,用于选择信息量最大的数据点集合。在数据修剪、主动学习和数据标注三个实际应用中,基于infoVerse空间选择的样本在所有任务中均持续优于强基线方法。我们的代码和演示系统已公开。