Dynamics prediction, which is the problem of predicting future states of scene objects based on current and prior states, is drawing increasing attention as an instance of learning physics. To solve this problem, Region Proposal Convolutional Interaction Network (RPCIN), a vision-based model, was proposed and achieved state-of-the-art performance in long-term prediction. RPCIN only takes raw images and simple object descriptions, such as the bounding box and segmentation mask of each object, as input. However, despite its success, the model's capability can be compromised under conditions of environment misalignment. In this paper, we investigate two challenging conditions for environment misalignment: Cross-Domain and Cross-Context by proposing four datasets that are designed for these challenges: SimB-Border, SimB-Split, BlenB-Border, and BlenB-Split. The datasets cover two domains and two contexts. Using RPCIN as a probe, experiments conducted on the combinations of the proposed datasets reveal potential weaknesses of the vision-based long-term dynamics prediction model. Furthermore, we propose a promising direction to mitigate the Cross-Domain challenge and provide concrete evidence supporting such a direction, which provides dramatic alleviation of the challenge on the proposed datasets.
翻译:动力学预测是指基于当前及先前状态预测场景物体未来状态的问题,作为物理学习的一个实例正日益受到关注。为解决这一问题,基于视觉的模型——区域提议卷积交互网络(RPCIN)被提出,并在长期预测中取得了最先进的性能。RPCIN仅以原始图像和简单物体描述(如每个物体的边界框和分割掩码)作为输入。然而,尽管其成功,该模型在环境错位条件下的能力可能会受到损害。本文通过提出四个专门针对这些挑战设计的数据集:SimB-Border、SimB-Split、BlenB-Border和BlenB-Split,研究了环境错位的两种挑战性条件:跨域与跨上下文。这些数据集涵盖了两个域和两种上下文。以RPCIN作为探针,在提出数据集的组合上进行的实验揭示了基于视觉的长期动力学预测模型的潜在弱点。此外,我们提出了一种缓解跨域挑战的有前景方向,并提供了具体证据支持该方向,显著减轻了所提出数据集上的挑战。