Anticipating object state changes in images and videos is a challenging problem whose solution has important implications in vision-based scene understanding, automated monitoring systems, and action planning. In this work, we propose the first method for solving this problem. The proposed method predicts object state changes that will occur in the near future as a result of yet unseen human actions. To address this new problem, we propose a novel framework that integrates learnt visual features that represent the recent visual information, with natural language (NLP) features that represent past object state changes and actions. Leveraging the extensive and challenging Ego4D dataset which provides a large-scale collection of first-person perspective videos across numerous interaction scenarios, we introduce new curated annotation data for the object state change anticipation task (OSCA), noted as Ego4D-OSCA. An extensive experimental evaluation was conducted that demonstrates the efficacy of the proposed method in predicting object state changes in dynamic scenarios. The proposed work underscores the potential of integrating video and linguistic cues to enhance the predictive performance of video understanding systems. Moreover, it lays the groundwork for future research on the new task of object state change anticipation. The source code and the new annotation data (Ego4D-OSCA) will be made publicly available.
翻译:在图像与视频中预测物体状态变化是一个极具挑战性的问题,其解决方案对于基于视觉的场景理解、自动化监控系统及动作规划具有重要应用价值。本文首次提出解决该问题的方法。所提方法能够预测因尚未观察到的人类动作而在近期内即将发生的物体状态变化。针对这一新问题,我们提出了一种集成框架:将表征近期视觉信息的学得视觉特征,与表征历史物体状态变化及动作的自然语言处理(NLP)特征相结合。借助覆盖大量交互场景的第一人称视角视频大规模数据集Ego4D的广泛性与挑战性,我们为物体状态变化预测任务(OSCA)引入了新的精标注释数据(Ego4D-OSCA)。通过大量实验评估,验证了所提方法在动态场景中预测物体状态变化的有效性。本工作凸显了融合视频与语言线索以提升视频理解系统预测性能的潜力,并为物体状态变化预测这一新任务奠定了研究基础。源代码与新注释数据(Ego4D-OSCA)将面向公众开源。