Humans can effortlessly anticipate how objects might move or change through interaction--imagining a cup being lifted, a knife slicing, or a lid being closed. We aim to endow computational systems with a similar ability to predict plausible future object motions directly from passive visual observation. We introduce ObjectForesight, a 3D object-centric dynamics model that predicts future 6-DoF poses and trajectories of rigid objects from short egocentric video sequences. Unlike conventional world or dynamics models that operate in pixel or latent space, ObjectForesight represents the world explicitly in 3D at the object level, enabling geometrically grounded and temporally coherent predictions that capture object affordances and trajectories. To train such a model at scale, we leverage recent advances in segmentation, mesh reconstruction, and 3D pose estimation to curate a dataset of 2 million plus short clips with pseudo-ground-truth 3D object trajectories. Through extensive experiments, we show that ObjectForesight achieves significant gains in accuracy, geometric consistency, and generalization to unseen objects and scenes, establishing a scalable framework for learning physically grounded, object-centric dynamics models directly from observation. objectforesight.github.io
翻译:人类能够毫不费力地预想物体如何通过交互进行移动或改变——例如想象一个杯子被拿起、一把刀在切割或一个盖子被盖上。我们的目标是赋予计算系统类似的能力,使其能够直接从被动视觉观察中预测合理的未来物体运动。我们提出了ObjectForesight,一个以三维物体为中心的动力模型,它能够从简短的第一人称视角视频序列中预测刚性物体未来的六自由度位姿与轨迹。与在像素空间或隐空间中运行的传统世界模型或动力模型不同,ObjectForesight在三维空间中以物体级别显式地表示世界,从而能够生成几何基础扎实、时间连贯的预测,捕捉物体的可供性和运动轨迹。为了大规模训练此类模型,我们利用近期在分割、网格重建和三维位姿估计方面的进展,构建了一个包含超过200万个短视频片段的数据集,这些片段带有伪真实值三维物体轨迹。通过大量实验,我们证明ObjectForesight在预测准确性、几何一致性以及对未见过的物体和场景的泛化能力方面均取得显著提升,从而建立了一个可扩展的框架,用于直接从观察中学习具有物理基础、以物体为中心的动力模型。objectforesight.github.io