The Street View House Numbers (SVHN) dataset is a popular benchmark dataset in deep learning. Originally designed for digit classification tasks, the SVHN dataset has been widely used as a benchmark for various other tasks including generative modeling. However, with this work, we aim to warn the community about an issue of the SVHN dataset as a benchmark for generative modeling tasks: we discover that the official split into training set and test set of the SVHN dataset are not drawn from the same distribution. We empirically show that this distribution mismatch has little impact on the classification task (which may explain why this issue has not been detected before), but it severely affects the evaluation of probabilistic generative models, such as Variational Autoencoders and diffusion models. As a workaround, we propose to mix and re-split the official training and test set when SVHN is used for tasks other than classification. We publish a new split and the indices we used to create it at https://jzenn.github.io/svhn-remix/ .
翻译:街景房屋号码(SVHN)数据集是深度学习领域广泛采用的基准数据集之一。该数据集最初设计用于数字分类任务,但已被广泛用作生成建模等多种其他任务的基准。然而,本研究旨在提醒学界注意SVHN数据集作为生成建模基准时存在的问题:我们发现SVHN数据集的官方训练集与测试集划分并非源自同一分布。我们通过实验表明,这种分布不匹配对分类任务影响甚微(这或许解释了为何此问题此前未被发现),但会严重干扰概率生成模型(如变分自编码器和扩散模型)的评估。作为解决方案,我们建议在将SVHN用于非分类任务时,混合并重新划分官方训练集与测试集。我们在https://jzenn.github.io/svhn-remix/ 上发布了新的数据集划分方案及其使用的索引。