Modern vision models typically rely on fine-tuning general-purpose models pre-trained on large, static datasets. These general-purpose models only capture the knowledge within their pre-training datasets, which are tiny, out-of-date snapshots of the Internet -- where billions of images are uploaded each day. We suggest an alternate approach: rather than hoping our static datasets transfer to our desired tasks after large-scale pre-training, we propose dynamically utilizing the Internet to quickly train a small-scale model that does extremely well on the task at hand. Our approach, called Internet Explorer, explores the web in a self-supervised manner to progressively find relevant examples that improve performance on a desired target dataset. It cycles between searching for images on the Internet with text queries, self-supervised training on downloaded images, determining which images were useful, and prioritizing what to search for next. We evaluate Internet Explorer across several datasets and show that it outperforms or matches CLIP oracle performance by using just a single GPU desktop to actively query the Internet for 30--40 hours. Results, visualizations, and videos at https://internet-explorer-ssl.github.io/
翻译:现代视觉模型通常依赖于在大规模静态数据集上预训练的通用模型进行微调。这些通用模型仅能捕捉其预训练数据集中的知识——而这类数据集只是互联网(每日有数十亿张图片上传)中微小且过时的快照。本文提出一种替代方案:与其寄望静态数据集在大规模预训练后能迁移至目标任务,我们主张动态利用互联网,快速训练一个在特定任务上表现优异的小规模模型。我们所提出的方法"网络探索者(Internet Explorer)"以自监督方式遍历网络,逐步寻找能提升目标数据集性能的相关样本。该方法在以下环节间循环迭代:通过文本查询在互联网中搜索图像、对下载图像进行自监督训练、判定哪些图像具有效用,并优先规划下一步搜索方向。我们在多个数据集上评估了网络探索者,结果表明仅需使用单个GPU台式机主动查询互联网30-40小时,该方法即可达到或超越CLIP基准性能。结果、可视化演示及视频请访问https://internet-explorer-ssl.github.io/