Physics Context Builders: A Modular Framework for Physical Reasoning in Vision-Language Models

Physical reasoning, which involves interpreting object behaviors within dynamic environments, remains a significant challenge for Vision-Language Models (VLMs). The limitations in physical reasoning arise from an inability to translate learned knowledge into predictions about physical behavior. We perform a careful study to show how continual fine-tuning can mitigate this issue. However, fine-tuning is expensive for large models and impractical to repeatedly perform for every task. This necessitates the creation of modular and scalable ways to teach VLMs about physical reasoning. To that end, we introduce Physics Context Builders (PCBs), a novel modular framework where specialized VLMs are fine-tuned to generate detailed physical scene descriptions. These can be used as physical contexts for larger VLMs to enhance their reasoning capabilities. PCBs enable the separation of visual perception from reasoning, allowing us to analyze their relative contributions to physical understanding. We perform careful experiments on CLEVRER and on Falling Tower, a stability detection dataset with both simulated and real-world scenes, to demonstrate that PCBs provide substantial performance improvements, increasing average accuracy by up to 13.8% on complex physical reasoning tasks. Notably, PCBs show strong Sim2Real transfer, successfully generalizing from simulated training data to real-world scenes. Our work demonstrates that enhancing visual perception through modular, simulation-trained components offers a practical approach to improving physical reasoning in VLMs, while providing insights into the factors affecting physical understanding in these models.

翻译：物理推理涉及在动态环境中解释物体行为，对视觉-语言模型而言仍是重大挑战。其局限性源于难以将习得知识转化为物理行为预测。我们通过严谨研究表明持续微调可缓解此问题，但大模型的微调成本高昂，且难以针对每个任务重复实施。这需要创建模块化、可扩展的方法来教授视觉-语言模型进行物理推理。为此，我们提出物理情境构建器——一种创新的模块化框架，通过微调专用视觉-语言模型来生成精细的物理场景描述。这些描述可作为大型视觉-语言模型的物理情境输入以增强其推理能力。该框架实现了视觉感知与推理过程的解耦，使我们能分析二者对物理理解的相对贡献。我们在CLEVRER数据集及包含仿真与现实场景的稳定性检测数据集Falling Tower上进行了系统实验，证明物理情境构建器能带来显著性能提升，在复杂物理推理任务中平均准确率最高提升13.8%。值得注意的是，该框架展现出强大的仿真到现实迁移能力，能成功将从仿真训练数据获得的知识泛化至真实场景。本研究证明：通过模块化的仿真训练组件增强视觉感知，为提升视觉-语言模型的物理推理能力提供了实用路径，同时揭示了影响此类模型物理理解能力的关键因素。

相关内容

MoDELS

关注 45

ACM/IEEE第23届模型驱动工程语言和系统国际会议，是模型驱动软件和系统工程的首要会议系列，由ACM-SIGSOFT和IEEE-TCSE支持组织。自1998年以来，模型涵盖了建模的各个方面，从语言和方法到工具和应用程序。模特的参加者来自不同的背景，包括研究人员、学者、工程师和工业专业人士。MODELS 2019是一个论坛，参与者可以围绕建模和模型驱动的软件和系统交流前沿研究成果和创新实践经验。今年的版本将为建模社区提供进一步推进建模基础的机会，并在网络物理系统、嵌入式系统、社会技术系统、云计算、大数据、机器学习、安全、开源等新兴领域提出建模的创新应用以及可持续性。官网链接：http://www.modelsconference.org/

O’Reilly报告：知识图谱崛起——面向现代数据集成和数据结构体系，“The Rise of the Knowledge Graph——Toward Modern Data Integration and the Data Fabric Architecture”

专知会员服务

49+阅读 · 2022年2月18日

UCM《机器学习导论笔记》，80页pdf CSE176 Introduction to Machine Learning

专知会员服务

32+阅读 · 2021年9月29日

FlowQA: Grasping Flow in History for Conversational Machine Comprehension

专知会员服务

34+阅读 · 2019年10月18日

Auto-Sizing the Transformer Network: Improving Speed, Efficiency, and Performance for Low-Resource Machine Translation

专知会员服务

50+阅读 · 2019年10月17日