Understanding the physical world, including object dynamics, material properties, and causal interactions, remains a core challenge in artificial intelligence. Although recent multi-modal large language models (MLLMs) have demonstrated impressive general reasoning capabilities, they still fall short of achieving human-level understanding of physical principles. Existing datasets for physical reasoning either rely on real-world videos, which incur high annotation costs, or on synthetic simulations, which suffer from limited realism and diversity. In this paper, we propose a novel paradigm that leverages glitches in gameplay videos, referring to visual anomalies that violate predefined physical laws, as a rich and scalable supervision source for physical world understanding. We introduce PhysGame, an meta information guided instruction-tuning dataset containing 140,057 glitch-centric question-answer pairs across five physical domains and sixteen fine-grained categories. To ensure data accuracy, we design a prompting strategy that utilizes gameplay metadata such as titles and descriptions to guide high-quality QA generation. Complementing PhysGame, we construct GameBench, an expert-annotated benchmark with 880 glitch-identified gameplay videos designed to evaluate physical reasoning capabilities. Extensive experiments show that PhysGame significantly enhances both Game2Real transferability, improving the real world physical reasoning performance of Qwen2.5VL by 2.5% on PhysBench, and Game2General transferability, yielding a 1.9% gain on the MVBench benchmark. Moreover, PhysGame-tuned models achieve a 3.7% absolute improvement on GameBench, demonstrating enhanced robustness in detecting physical implausibilities. These results indicate that learning from gameplay anomalies offers a scalable and effective pathway toward advancing physical world understanding in multimodal intelligence.
翻译:理解物理世界,包括物体动力学、材料属性和因果交互,仍然是人工智能领域的核心挑战。尽管近期的多模态大语言模型(MLLMs)已展现出令人印象深刻的通用推理能力,但在实现人类水平的物理原理理解方面仍存在不足。现有的物理推理数据集要么依赖真实世界视频(标注成本高昂),要么依赖合成仿真(真实性和多样性有限)。本文提出一种新颖范式,利用游戏视频中的故障(即违反预设物理定律的视觉异常)作为理解物理世界的丰富且可扩展的监督源。我们引入了PhysGame,这是一个元信息引导的指令微调数据集,包含跨越五个物理领域和十六个细粒度类别的140,057个以故障为中心的问答对。为确保数据准确性,我们设计了一种提示策略,利用游戏标题和描述等元数据来引导高质量问答对的生成。作为PhysGame的补充,我们构建了GameBench,这是一个包含880个经专家标注的故障识别游戏视频的基准测试集,旨在评估物理推理能力。大量实验表明,PhysGame显著提升了Game2Real可迁移性(将Qwen2.5VL在PhysBench上的真实世界物理推理性能提高了2.5%)和Game2General可迁移性(在MVBench基准上获得了1.9%的性能增益)。此外,经PhysGame微调的模型在GameBench上实现了3.7%的绝对性能提升,显示出在检测物理不合理性方面更强的鲁棒性。这些结果表明,从游戏异常中学习为推进多模态智能的物理世界理解提供了一条可扩展且有效的途径。