Unsupervised physical parameter estimation from video lacks a common benchmark: existing methods evaluate on non-overlapping synthetic data, the sole real-world dataset is restricted to single-body systems, and no established protocol addresses governing-equation identification. This work introduces IRIS, a high-fidelity benchmark comprising 220 real-world videos captured at 4K resolution and 60\,fps, spanning both single- and multi-body dynamics with independently measured ground-truth parameters and uncertainty estimates. Each dynamical system is recorded under controlled laboratory conditions and paired with its governing equations, enabling principled evaluation. A standardized evaluation protocol is defined encompassing parameter accuracy, identifiability, extrapolation, robustness, and governing-equation selection. Multiple baselines are evaluated, including a multi-step physics loss formulation and four complementary equation-identification strategies (VLM temporal reasoning, describe-then-classify prompting, CNN-based classification, and path-based labelling), establishing reference performance across all IRIS scenarios and exposing systematic failure modes that motivate future research. The dataset, annotations, evaluation toolkit, and all baseline implementations are publicly released.
翻译:从视频中进行无监督物理参数估计缺乏统一的基准:现有方法在互不重叠的合成数据上进行评估,唯一的真实世界数据集仅限于单刚体系统,且尚无既定协议用于处理控制方程的识别。本研究提出了IRIS,这是一个高保真基准,包含220段以4K分辨率和60帧/秒拍摄的真实世界视频,涵盖单刚体与多刚体动力学,并配有独立测量的真实参数及不确定性估计。每个动态系统均在受控实验室条件下录制,并与其控制方程配对,从而实现有原则的评估。我们定义了一个标准化评估协议,涵盖参数准确性、可识别性、外推性、鲁棒性以及控制方程选择。评估了多种基线方法,包括一个多步物理损失公式和四种互补的方程识别策略(VLM时序推理、描述-再分类提示、基于CNN的分类以及基于路径的标注),为所有IRIS场景建立了参考性能,并揭示了系统性的失效模式,为未来研究提供了方向。数据集、标注、评估工具包及所有基线实现均已公开发布。