What Makes ImageNet Look Unlike LAION

ImageNet was famously created from Flickr image search results. What if we recreated ImageNet instead by searching the massive LAION dataset based on image captions alone? In this work, we carry out this counterfactual investigation. We find that the resulting ImageNet recreation, which we call LAIONet, looks distinctly unlike the original. Specifically, the intra-class similarity of images in the original ImageNet is dramatically higher than it is for LAIONet. Consequently, models trained on ImageNet perform significantly worse on LAIONet. We propose a rigorous explanation for the discrepancy in terms of a subtle, yet important, difference in two plausible causal data-generating processes for the respective datasets, that we support with systematic experimentation. In a nutshell, searching based on an image caption alone creates an information bottleneck that mitigates the selection bias otherwise present in image-based filtering. Our explanation formalizes a long-held intuition in the community that ImageNet images are stereotypical, unnatural, and overly simple representations of the class category. At the same time, it provides a simple and actionable takeaway for future dataset creation efforts.

翻译：ImageNet的创建众所周知源于Flickr图像搜索结果。如果我们仅依据图像描述，通过搜索海量LAION数据集来重建ImageNet，结果会如何？本研究正是对这一反事实假设的实证探索。我们发现由此生成的重建数据集（称为LAIONet）与原始数据集存在显著差异。具体而言，原始ImageNet中图像的类内相似度明显高于LAIONet。因此，在ImageNet上训练的模型在LAIONet上的表现显著下降。我们通过系统化实验验证，提出一种严谨的解释：这种差异源于两个数据集各自可能的数据生成过程中存在微妙却关键的区别。简而言之，仅基于图像描述的搜索会形成信息瓶颈，从而缓解基于图像过滤时可能存在的选择偏差。我们的解释形式化地验证了学界长期持有的直觉：ImageNet图像具有刻板化、非自然且过度简化的类别表征特征。同时，这一发现为未来数据集构建工作提供了简明且可操作的启示。

相关内容

ImageNet (数据集)

关注 22

ImageNet项目是一个用于视觉对象识别软件研究的大型可视化数据库。超过1400万的图像URL被ImageNet手动注释，以指示图片中的对象;在至少一百万个图像中，还提供了边界框。ImageNet包含2万多个类别; [2]一个典型的类别，如“气球”或“草莓”，包含数百个图像。第三方图像URL的注释数据库可以直接从ImageNet免费获得;但是，实际的图像不属于ImageNet。自2010年以来，ImageNet项目每年举办一次软件比赛，即ImageNet大规模视觉识别挑战赛（ILSVRC），软件程序竞相正确分类检测物体和场景。 ImageNet挑战使用了一个“修剪”的1000个非重叠类的列表。2012年在解决ImageNet挑战方面取得了巨大的突破，被广泛认为是2010年的深度学习革命的开始。

Graph Transformer近期进展

专知会员服务

65+阅读 · 2023年1月5日

Linux导论，Introduction to Linux，96页ppt

专知会员服务

82+阅读 · 2020年7月26日

FlowQA: Grasping Flow in History for Conversational Machine Comprehension

专知会员服务

34+阅读 · 2019年10月18日

Auto-Sizing the Transformer Network: Improving Speed, Efficiency, and Performance for Low-Resource Machine Translation

专知会员服务

50+阅读 · 2019年10月17日