The ability to construct mental models of the world is a central aspect of understanding. Similarly, visual understanding can be viewed as the ability to construct a representative model of the system depicted in an image. This work explores the capacity of Vision Language Models (VLMs) to recognize and simulate the systems and mechanisms depicted in images using the Im2Sim methodology. The VLM is given a natural image of a real-world system (e.g., cities, clouds, vegetation) and is tasked with describing the system and writing code that simulates and generates it. This generative code is then executed to produce a synthetic image, which is compared against the original. This approach is tested on various complex emergent systems, ranging from physical systems (waves, lights, clouds) to vegetation, cities, materials, and geological formations. Through analysis of the models and images generated by the VLMs, we examine their understanding of the systems in images. The results show that leading VLMs (GPT, Gemini) have the ability to understand and model complex, multi-component systems across multiple layers of abstraction and a wide range of domains. At the same time, the VLMs exhibit limited ability to replicate fine details and low-level arrangements of patterns in the image. These findings reveal an interesting asymmetry: VLMs combine high-level, deep visual understanding of images with limited perception of fine details.
翻译:构建世界的心智模型是理解的核心能力。类似地,视觉理解可被视为构建图像中所描绘系统的表征模型的能力。本研究探索了视觉语言模型(VLMs)通过 Im2Sim 方法识别并仿真图像中描绘的系统与机制的能力。该模型接收真实世界系统(如城市、云层、植被)的自然图像,其任务是描述系统并编写能够仿真和生成该系统的代码。随后执行此生成代码以产生合成图像,并与原始图像进行比较。该方法在多种复杂的涌现系统上进行了测试,涵盖物理系统(波浪、光线、云层)、植被、城市、材料及地质构造。通过对 VLM 生成的模型和图像进行分析,我们检验了其对图像中系统的理解。结果表明,领先的 VLM(GPT、Gemini)具备跨多个抽象层次和广泛领域理解并建模复杂多组件系统的能力。同时,VLM 在复制图像中精细细节和低层级模式排列方面表现出有限的能力。这些发现揭示了一种有趣的不对称性:VLM 结合了对图像的高层次深度视觉理解与对精细细节的有限感知。