This work investigates dataset vectorization for two dataset-level tasks: assessing training set suitability and test set difficulty. The former measures how suitable a training set is for a target domain, while the latter studies how challenging a test set is for a learned model. Central to the two tasks is measuring the underlying relationship between datasets. This needs a desirable dataset vectorization scheme, which should preserve as much discriminative dataset information as possible so that the distance between the resulting dataset vectors can reflect dataset-to-dataset similarity. To this end, we propose a bag-of-prototypes (BoP) dataset representation that extends the image-level bag consisting of patch descriptors to dataset-level bag consisting of semantic prototypes. Specifically, we develop a codebook consisting of K prototypes clustered from a reference dataset. Given a dataset to be encoded, we quantize each of its image features to a certain prototype in the codebook and obtain a K-dimensional histogram. Without assuming access to dataset labels, the BoP representation provides a rich characterization of the dataset semantic distribution. Furthermore, BoP representations cooperate well with Jensen-Shannon divergence for measuring dataset-to-dataset similarity. Although very simple, BoP consistently shows its advantage over existing representations on a series of benchmarks for two dataset-level tasks.
翻译:本文研究面向两个数据集级别任务的数据集向量化方法:评估训练集适宜性和测试集难度。前者衡量训练集对目标领域的适宜程度,后者探究测试集对已学习模型的挑战程度。这两个任务的核心在于度量数据集之间的潜在关系。这需要一种理想的数据集向量化方案,该方案应尽可能保留具有判别性的数据集信息,使得生成的数据集向量间的距离能够反映数据集间的相似度。为此,我们提出一种原型包(BoP)数据集表示方法,将图像级补丁描述符包扩展为数据集级语义原型包。具体而言,我们构建了一个由参考数据集聚类得到的K个原型组成的码本。对于待编码的数据集,我们将其每个图像特征量化到码本中的某个原型,并生成K维直方图。在无需访问数据集标签的情况下,BoP表示能够提供数据集语义分布的丰富特征。此外,BoP表示与詹森-香农散度协同配合良好,可用于度量数据集间的相似度。尽管方法极为简洁,BoP在两个数据集级别任务的一系列基准测试中始终展现出优于现有表示方法的性能优势。