We present a new perspective on bridging the generalization gap between biological and computer vision -- mimicking the human visual diet. While computer vision models rely on internet-scraped datasets, humans learn from limited 3D scenes under diverse real-world transformations with objects in natural context. Our results demonstrate that incorporating variations and contextual cues ubiquitous in the human visual training data (visual diet) significantly improves generalization to real-world transformations such as lighting, viewpoint, and material changes. This improvement also extends to generalizing from synthetic to real-world data -- all models trained with a human-like visual diet outperform specialized architectures by large margins when tested on natural image data. These experiments are enabled by our two key contributions: a novel dataset capturing scene context and diverse real-world transformations to mimic the human visual diet, and a transformer model tailored to leverage these aspects of the human visual diet. All data and source code can be accessed at https://github.com/Spandan-Madan/human_visual_diet.
翻译:我们提出了一种弥合生物视觉与计算机视觉之间泛化差距的新视角——模仿人类视觉饮食。计算机视觉模型依赖互联网爬取的数据集,而人类则从有限的3D场景中学习,这些场景包含自然环境中物体在多样真实世界变换下的各种变化。我们的研究结果表明,在人类视觉训练数据(视觉饮食)中融入普遍存在的变异性与场景线索,能显著提升对光照、视角及材质变化等真实世界变换的泛化能力。这种提升同样延伸至从合成数据向真实世界数据的泛化——所有采用类人视觉饮食训练的模型,在自然图像数据测试时,其性能均大幅超越专用架构模型。这些实验得益于我们的两项核心贡献:一个用于捕捉场景上下文和多样真实世界变换以模拟人类视觉饮食的新型数据集,以及一个专为利用人类视觉饮食这些特性而设计的Transformer模型。所有数据与源代码均可从https://github.com/Spandan-Madan/human_visual_diet获取。