Image classifiers are information-discarding machines, by design. Yet, how these models discard information remains mysterious. We hypothesize that one way for image classifiers to reach high accuracy is to first zoom to the most discriminative region in the image and then extract features from there to predict image labels, discarding the rest of the image. Studying six popular networks ranging from AlexNet to CLIP, we find that proper framing of the input image can lead to the correct classification of 98.91% of ImageNet images. Furthermore, we uncover positional biases in various datasets, especially a strong center bias in two popular datasets: ImageNet-A and ObjectNet. Finally, leveraging our insights into the potential of zooming, we propose a test-time augmentation (TTA) technique that improves classification accuracy by forcing models to explicitly perform zoom-in operations before making predictions. Our method is more interpretable, accurate, and faster than MEMO, a state-of-the-art (SOTA) TTA method. We introduce ImageNet-Hard, a new benchmark that challenges SOTA classifiers including large vision-language models even when optimal zooming is allowed.
翻译:图像分类器本质上是一类信息丢弃的机器。然而,这些模型如何丢弃信息仍是一个谜。我们假设图像分类器实现高精度的一种途径是:首先将图像中最具判别性的区域进行缩放,然后从中提取特征以预测图像标签,并丢弃图像其余部分。通过研究从AlexNet到CLIP的六种流行网络,我们发现对输入图像进行适当裁剪可使98.91%的ImageNet图像被正确分类。此外,我们还揭示了不同数据集中的位置偏差,尤其是两个流行数据集ImageNet-A和ObjectNet中存在的强中心偏差。最后,基于对缩放潜力的洞察,我们提出一种测试时增强(TTA)技术,通过强制模型在做出预测前显式执行缩放操作来提升分类精度。该方法在可解释性、准确性和速度上均优于当前最先进(SOTA)的TTA方法MEMO。我们引入了新基准ImageNet-Hard,该基准即便在允许最优缩放的情况下,仍能对包括大型视觉语言模型在内的SOTA分类器构成挑战。