Physion++: Evaluating Physical Scene Understanding that Requires Online Inference of Different Physical Properties

General physical scene understanding requires more than simply localizing and recognizing objects -- it requires knowledge that objects can have different latent properties (e.g., mass or elasticity), and that those properties affect the outcome of physical events. While there has been great progress in physical and video prediction models in recent years, benchmarks to test their performance typically do not require an understanding that objects have individual physical properties, or at best test only those properties that are directly observable (e.g., size or color). This work proposes a novel dataset and benchmark, termed Physion++, that rigorously evaluates visual physical prediction in artificial systems under circumstances where those predictions rely on accurate estimates of the latent physical properties of objects in the scene. Specifically, we test scenarios where accurate prediction relies on estimates of properties such as mass, friction, elasticity, and deformability, and where the values of those properties can only be inferred by observing how objects move and interact with other objects or fluids. We evaluate the performance of a number of state-of-the-art prediction models that span a variety of levels of learning vs. built-in knowledge, and compare that performance to a set of human predictions. We find that models that have been trained using standard regimes and datasets do not spontaneously learn to make inferences about latent properties, but also that models that encode objectness and physical states tend to make better predictions. However, there is still a huge gap between all models and human performance, and all models' predictions correlate poorly with those made by humans, suggesting that no state-of-the-art model is learning to make physical predictions in a human-like way. Project page: https://dingmyu.github.io/physion_v2/

翻译：通用的物理场景理解不仅需要定位和识别物体，还需要具备物体可能具有不同潜在属性（例如质量或弹性）的知识，并且这些属性会影响物理事件的结果。尽管近年来物理和视频预测模型取得了巨大进展，但用于测试其性能的基准通常不需要理解物体具有个体物理属性，或者最多只测试那些可直接观测的属性（例如大小或颜色）。本研究提出了一个名为Physion++的新数据集和基准，能够在预测依赖于对场景中物体潜在物理属性准确估计的情况下，严格评估人工系统中的视觉物理预测能力。具体而言，我们测试了准确预测依赖于对质量、摩擦、弹性和可变形性等属性估计的场景，而这些属性的值只能通过观察物体如何运动以及与其他物体或流体相互作用来推断。我们评估了多个涵盖不同学习与内置知识水平的先进预测模型的性能，并将其与一组人类预测结果进行比较。我们发现，使用标准训练机制和数据集训练的模型无法自发学会推断潜在属性，但编码了物体性和物理状态的模型倾向于做出更好的预测。然而，所有模型与人类表现之间仍存在巨大差距，且所有模型的预测与人类预测的相关性较低，这表明目前尚无先进模型以类似人类的方式学习进行物理预测。项目页面：https://dingmyu.github.io/physion_v2/