Physical reasoning, which involves interpreting object behaviors within dynamic environments, remains a significant challenge for Vision-Language Models (VLMs). The limitations in physical reasoning arise from an inability to translate learned knowledge into predictions about physical behavior. We perform a careful study to show how continual fine-tuning can mitigate this issue. However, fine-tuning is expensive for large models and impractical to repeatedly perform for every task. This necessitates the creation of modular and scalable ways to teach VLMs about physical reasoning. To that end, we introduce Physics Context Builders (PCBs), a novel modular framework where specialized VLMs are fine-tuned to generate detailed physical scene descriptions. These can be used as physical contexts for larger VLMs to enhance their reasoning capabilities. PCBs enable the separation of visual perception from reasoning, allowing us to analyze their relative contributions to physical understanding. We perform careful experiments on CLEVRER and on Falling Tower, a stability detection dataset with both simulated and real-world scenes, to demonstrate that PCBs provide substantial performance improvements, increasing average accuracy by up to 13.8% on complex physical reasoning tasks. Notably, PCBs show strong Sim2Real transfer, successfully generalizing from simulated training data to real-world scenes. Our work demonstrates that enhancing visual perception through modular, simulation-trained components offers a practical approach to improving physical reasoning in VLMs, while providing insights into the factors affecting physical understanding in these models.
翻译:物理推理涉及在动态环境中解释物体行为,对视觉-语言模型而言仍是重大挑战。其局限性源于难以将习得知识转化为物理行为预测。我们通过严谨研究表明持续微调可缓解此问题,但大模型的微调成本高昂,且难以针对每个任务重复实施。这需要创建模块化、可扩展的方法来教授视觉-语言模型进行物理推理。为此,我们提出物理情境构建器——一种创新的模块化框架,通过微调专用视觉-语言模型来生成精细的物理场景描述。这些描述可作为大型视觉-语言模型的物理情境输入以增强其推理能力。该框架实现了视觉感知与推理过程的解耦,使我们能分析二者对物理理解的相对贡献。我们在CLEVRER数据集及包含仿真与现实场景的稳定性检测数据集Falling Tower上进行了系统实验,证明物理情境构建器能带来显著性能提升,在复杂物理推理任务中平均准确率最高提升13.8%。值得注意的是,该框架展现出强大的仿真到现实迁移能力,能成功将从仿真训练数据获得的知识泛化至真实场景。本研究证明:通过模块化的仿真训练组件增强视觉感知,为提升视觉-语言模型的物理推理能力提供了实用路径,同时揭示了影响此类模型物理理解能力的关键因素。