A Bag-of-Prototypes Representation for Dataset-Level Applications

This work investigates dataset vectorization for two dataset-level tasks: assessing training set suitability and test set difficulty. The former measures how suitable a training set is for a target domain, while the latter studies how challenging a test set is for a learned model. Central to the two tasks is measuring the underlying relationship between datasets. This needs a desirable dataset vectorization scheme, which should preserve as much discriminative dataset information as possible so that the distance between the resulting dataset vectors can reflect dataset-to-dataset similarity. To this end, we propose a bag-of-prototypes (BoP) dataset representation that extends the image-level bag consisting of patch descriptors to dataset-level bag consisting of semantic prototypes. Specifically, we develop a codebook consisting of K prototypes clustered from a reference dataset. Given a dataset to be encoded, we quantize each of its image features to a certain prototype in the codebook and obtain a K-dimensional histogram. Without assuming access to dataset labels, the BoP representation provides a rich characterization of the dataset semantic distribution. Furthermore, BoP representations cooperate well with Jensen-Shannon divergence for measuring dataset-to-dataset similarity. Although very simple, BoP consistently shows its advantage over existing representations on a series of benchmarks for two dataset-level tasks.

翻译：本文研究面向两个数据集级别任务的数据集向量化方法：评估训练集适宜性和测试集难度。前者衡量训练集对目标领域的适宜程度，后者探究测试集对已学习模型的挑战程度。这两个任务的核心在于度量数据集之间的潜在关系。这需要一种理想的数据集向量化方案，该方案应尽可能保留具有判别性的数据集信息，使得生成的数据集向量间的距离能够反映数据集间的相似度。为此，我们提出一种原型包（BoP）数据集表示方法，将图像级补丁描述符包扩展为数据集级语义原型包。具体而言，我们构建了一个由参考数据集聚类得到的K个原型组成的码本。对于待编码的数据集，我们将其每个图像特征量化到码本中的某个原型，并生成K维直方图。在无需访问数据集标签的情况下，BoP表示能够提供数据集语义分布的丰富特征。此外，BoP表示与詹森-香农散度协同配合良好，可用于度量数据集间的相似度。尽管方法极为简洁，BoP在两个数据集级别任务的一系列基准测试中始终展现出优于现有表示方法的性能优势。

相关内容

数据集

关注 88

数据集，又称为资料集、数据集合或资料集合，是一种由数据所组成的集合。
Data set（或dataset）是一个数据的集合，通常以表格形式出现。每一列代表一个特定变量。每一行都对应于某一成员的数据集的问题。它列出的价值观为每一个变量，如身高和体重的一个物体或价值的随机数。每个数值被称为数据资料。对应于行数，该数据集的数据可能包括一个或多个成员。

【ICML2020】学习支持外推的表示学习，Learning Representations that Support Extrapolation

专知会员服务

26+阅读 · 2020年7月14日

【ICML2020】用于图结构化数据的卷积核网络，Convolutional Kernel Networks for Graph-Structured Data

专知会员服务

44+阅读 · 2020年6月29日

零样本文本分类，Zero-Shot Learning for Text Classification

专知会员服务

97+阅读 · 2020年5月31日

【领域对抗学习的低资源文本分类】Low-Resource Text Classification using Domain-Adversarial Learning

专知会员服务

23+阅读 · 2020年4月22日