Physion++: Evaluating Physical Scene Understanding that Requires Online Inference of Different Physical Properties

General physical scene understanding requires more than simply localizing and recognizing objects -- it requires knowledge that objects can have different latent properties (e.g., mass or elasticity), and that those properties affect the outcome of physical events. While there has been great progress in physical and video prediction models in recent years, benchmarks to test their performance typically do not require an understanding that objects have individual physical properties, or at best test only those properties that are directly observable (e.g., size or color). This work proposes a novel dataset and benchmark, termed Physion++, that rigorously evaluates visual physical prediction in artificial systems under circumstances where those predictions rely on accurate estimates of the latent physical properties of objects in the scene. Specifically, we test scenarios where accurate prediction relies on estimates of properties such as mass, friction, elasticity, and deformability, and where the values of those properties can only be inferred by observing how objects move and interact with other objects or fluids. We evaluate the performance of a number of state-of-the-art prediction models that span a variety of levels of learning vs. built-in knowledge, and compare that performance to a set of human predictions. We find that models that have been trained using standard regimes and datasets do not spontaneously learn to make inferences about latent properties, but also that models that encode objectness and physical states tend to make better predictions. However, there is still a huge gap between all models and human performance, and all models' predictions correlate poorly with those made by humans, suggesting that no state-of-the-art model is learning to make physical predictions in a human-like way. Project page: https://dingmyu.github.io/physion_v2/

翻译：通用物理场景理解不仅需要对物体进行定位和识别，更需要理解物体可能具有不同的潜在属性（如质量或弹性），且这些属性会影响物理事件的结果。尽管近年来物理与视频预测模型取得了显著进展，但用于测试其性能的基准通常不需要理解物体具有个体物理属性，最多仅测试可直接观测的属性（如大小或颜色）。本研究提出了一种名为Physion++的新数据集与基准，能够在预测依赖于对场景中物体潜在物理属性准确估计的情况下，严格评估人工系统中的视觉物理预测能力。具体而言，我们测试了准确预测依赖于质量、摩擦力、弹性与可变形性等属性估计的场景，且这些属性值只能通过观察物体如何运动及如何与其他物体或流体交互来推断。我们评估了多个涵盖不同学习与内置知识层次的最先进预测模型的性能，并将其与一组人类预测结果进行对比。研究发现，使用标准训练框架和数据集训练的模型无法自发学会推断潜在属性，但编码了物体性和物理状态的模型倾向于做出更优预测。然而，所有模型与人类表现之间仍存在巨大差距，且模型预测与人类预测的相关性很低，这表明当前最先进模型并未以类似人类的方式学习进行物理预测。项目页面：https://dingmyu.github.io/physion_v2/