Diverse, Difficult, and Odd Instances (D2O): A New Test Set for Object Classification

Test sets are an integral part of evaluating models and gauging progress in object recognition, and more broadly in computer vision and AI. Existing test sets for object recognition, however, suffer from shortcomings such as bias towards the ImageNet characteristics and idiosyncrasies (e.g., ImageNet-V2), being limited to certain types of stimuli (e.g., indoor scenes in ObjectNet), and underestimating the model performance (e.g., ImageNet-A). To mitigate these problems, we introduce a new test set, called D2O, which is sufficiently different from existing test sets. Images are a mix of generated images as well as images crawled from the web. They are diverse, unmodified, and representative of real-world scenarios and cause state-of-the-art models to misclassify them with high confidence. To emphasize generalization, our dataset by design does not come paired with a training set. It contains 8,060 images spread across 36 categories, out of which 29 appear in ImageNet. The best Top-1 accuracy on our dataset is around 60% which is much lower than 91% best Top-1 accuracy on ImageNet. We find that popular vision APIs perform very poorly in detecting objects over D2O categories such as ``faces'', ``cars'', and ``cats''. Our dataset also comes with a ``miscellaneous'' category, over which we test the image tagging models. Overall, our investigations demonstrate that the D2O test set contain a mix of images with varied levels of difficulty and is predictive of the average-case performance of models. It can challenge object recognition models for years to come and can spur more research in this fundamental area.

翻译：测试集是评估模型性能、衡量物体识别乃至更广泛的计算机视觉与人工智能领域进展的重要组成部分。然而，现有的物体识别测试集存在若干缺陷，例如对ImageNet特性及特异性的偏向（如ImageNet-V2）、局限于特定刺激类型（如ObjectNet中的室内场景），以及低估模型性能（如ImageNet-A）。为缓解这些问题，我们提出了一种名为D²O的新型测试集，其与现有测试集具有显著差异。该测试集图像由生成图像与网络爬取图像混合构成，具有多样性、未经修改且能真实反映现实场景的特点，能够导致最先进模型以高置信度对其进行错误分类。为强调泛化能力，本数据集在设计上不附带训练集。它包含8,060张图像，覆盖36个类别，其中29个类别与ImageNet重叠。本数据集上最佳Top-1准确率约为60%，远低于ImageNet上91%的最佳Top-1准确率。我们发现，主流视觉API在检测D²O中"人脸"、"汽车"、"猫"等类别时表现极差。数据集还包含"其他杂项"类别，用于测试图像标注模型。总体而言，我们的研究表明D²O测试集包含难度层级多样的混合图像，能够有效预测模型在平均性能下的表现。它将在未来数年持续挑战物体识别模型，推动这一基础领域的深入研究。