Modern vision models typically rely on fine-tuning general-purpose models pre-trained on large, static datasets. These general-purpose models only capture the knowledge within their pre-training datasets, which are tiny, out-of-date snapshots of the Internet -- where billions of images are uploaded each day. We suggest an alternate approach: rather than hoping our static datasets transfer to our desired tasks after large-scale pre-training, we propose dynamically utilizing the Internet to quickly train a small-scale model that does extremely well on the task at hand. Our approach, called Internet Explorer, explores the web in a self-supervised manner to progressively find relevant examples that improve performance on a desired target dataset. It cycles between searching for images on the Internet with text queries, self-supervised training on downloaded images, determining which images were useful, and prioritizing what to search for next. We evaluate Internet Explorer across several datasets and show that it outperforms or matches CLIP oracle performance by using just a single GPU desktop to actively query the Internet for 30--40 hours. Results, visualizations, and videos at https://internet-explorer-ssl.github.io/
翻译:现代视觉模型通常依赖对基于大型静态数据集预训练的通用模型进行微调。这些通用模型仅能捕捉其预训练数据集中的知识,而此类数据集仅是互联网(每天上传数十亿张图像)的微小且过时的快照。我们提出另一种方法:与其寄希望于静态数据集经过大规模预训练后能迁移到目标任务,不如主张动态利用互联网快速训练一个在特定任务上表现优异的轻量级模型。我们的方法名为"互联网浏览器",它通过自监督方式探索网络,逐步寻找能提升目标数据集性能的相关样本。该方法在以下环节循环迭代:通过文本查询在互联网上搜索图像、对下载图像进行自监督训练、判定图像有效性,并优化后续搜索优先级。我们在多个数据集上评估了互联网浏览器,实验表明:仅需一台配备单GPU的桌面设备,通过主动查询互联网30-40小时,该方法的性能即可超越或媲美CLIP的预言机(oracle)性能。结果、可视化与视频见https://internet-explorer-ssl.github.io/