infoVerse: A Universal Framework for Dataset Characterization with Multidimensional Meta-information

The success of NLP systems often relies on the availability of large, high-quality datasets. However, not all samples in these datasets are equally valuable for learning, as some may be redundant or noisy. Several methods for characterizing datasets based on model-driven meta-information (e.g., model's confidence) have been developed, but the relationship and complementary effects of these methods have received less attention. In this paper, we introduce infoVerse, a universal framework for dataset characterization, which provides a new feature space that effectively captures multidimensional characteristics of datasets by incorporating various model-driven meta-information. infoVerse reveals distinctive regions of the dataset that are not apparent in the original semantic space, hence guiding users (or models) in identifying which samples to focus on for exploration, assessment, or annotation. Additionally, we propose a novel sampling method on infoVerse to select a set of data points that maximizes informativeness. In three real-world applications (data pruning, active learning, and data annotation), the samples chosen on infoVerse space consistently outperform strong baselines in all applications. Our code and demo are publicly available.

翻译：自然语言处理系统的成功通常依赖于大规模高质量数据集的可用性。然而，这些数据集中的样本并非都对学习具有同等价值，部分样本可能存在冗余或噪声。现有研究已开发出多种基于模型驱动元信息（如模型置信度）的数据集表征方法，但这些方法之间的关联性与互补效应尚未得到充分关注。本文提出infoVerse——一种通用的数据集表征框架，通过融合多种模型驱动元信息构建新的特征空间，有效捕捉数据集的多维特性。infoVerse能够揭示原始语义空间中不可见的独特数据区域，从而指导用户（或模型）识别需要重点探索、评估或标注的样本。此外，我们提出一种基于infoVerse的创新采样方法，用于选择信息量最大的数据点集合。在数据修剪、主动学习和数据标注三个实际应用中，基于infoVerse空间选择的样本在所有任务中均持续优于强基线方法。我们的代码和演示系统已公开。

相关内容

数据集

关注 88

数据集，又称为资料集、数据集合或资料集合，是一种由数据所组成的集合。
Data set（或dataset）是一个数据的集合，通常以表格形式出现。每一列代表一个特定变量。每一行都对应于某一成员的数据集的问题。它列出的价值观为每一个变量，如身高和体重的一个物体或价值的随机数。每个数值被称为数据资料。对应于行数，该数据集的数据可能包括一个或多个成员。

【CVPR 2022】一个完全无监督的框架，从噪声和部分测量中学习图像，Robust Equivariant Imaging: a fully unsupervised framework for learning to image

专知会员服务

25+阅读 · 2022年3月3日

Linux导论，Introduction to Linux，96页ppt

专知会员服务

82+阅读 · 2020年7月26日

FlowQA: Grasping Flow in History for Conversational Machine Comprehension

专知会员服务

34+阅读 · 2019年10月18日

Auto-Sizing the Transformer Network: Improving Speed, Efficiency, and Performance for Low-Resource Machine Translation

专知会员服务

50+阅读 · 2019年10月17日