Discovering physical laws directly from high-dimensional visual data is a long-standing human pursuit but remains a formidable challenge for machines, representing a fundamental goal of scientific intelligence. This task is inherently difficult because physical knowledge is low-dimensional and structured, whereas raw video observations are high-dimensional and redundant, with most pixels carrying little or no physical meaning. Extracting concise, physically relevant variables from such noisy data remains a key obstacle. To address this, we propose Pixel2Phys, a collaborative multi-agent framework adaptable to any Multimodal Large Language Model (MLLM). It emulates human scientific reasoning by employing a structured workflow to extract formalized physical knowledge through iterative hypothesis generation, validation, and refinement. By repeatedly formulating, and refining candidate equations on high-dimensional data, it identifies the most concise representations that best capture the underlying physical evolution. This automated exploration mimics the iterative workflow of human scientists, enabling AI to reveal interpretable governing equations directly from raw observations. Across diverse simulated and real-world physics videos, Pixel2Phys discovers accurate, interpretable governing equations and maintaining stable long-term extrapolation where baselines rapidly diverge.
翻译:直接从高维视觉数据中发现物理定律是人类长期以来的追求,但对机器而言仍是一项艰巨的挑战,这代表了科学智能的一个根本目标。该任务本质上是困难的,因为物理知识是低维且结构化的,而原始视频观测则是高维且冗余的,大多数像素携带很少或根本没有物理意义。从这类噪声数据中提取简洁且物理相关的变量仍然是一个关键障碍。为解决此问题,我们提出Pixel2Phys,这是一个可适配于任何多模态大语言模型(MLLM)的协作式多智能体框架。它通过采用结构化工作流,经由迭代的假设生成、验证与精炼来提取形式化的物理知识,从而模拟人类的科学推理过程。通过对高维数据反复构建并精炼候选方程,它识别出最能捕捉底层物理演化的最简洁表示。这种自动化探索模拟了人类科学家的迭代工作流程,使人工智能能够直接从原始观测中揭示可解释的支配方程。在多种模拟和真实世界的物理视频中,Pixel2Phys发现了准确、可解释的支配方程,并在基线方法迅速发散的情况下保持了稳定的长期外推能力。