In this work, we introduce (a) the new problem of anticipating object state changes in images and videos during procedural activities, (b) new curated annotation data for object state change classification based on the Ego4D dataset, and (c) the first method for addressing this challenging problem. Solutions to this new task have important implications in vision-based scene understanding, automated monitoring systems, and action planning. The proposed novel framework predicts object state changes that will occur in the near future due to yet unseen human actions by integrating learned visual features that represent recent visual information with natural language (NLP) features that represent past object state changes and actions. Leveraging the extensive and challenging Ego4D dataset which provides a large-scale collection of first-person perspective videos across numerous interaction scenarios, we introduce an extension noted Ego4D-OSCA that provides new curated annotation data for the object state change anticipation task (OSCA). An extensive experimental evaluation is presented demonstrating the proposed method's efficacy in predicting object state changes in dynamic scenarios. The performance of the proposed approach also underscores the potential of integrating video and linguistic cues to enhance the predictive performance of video understanding systems and lays the groundwork for future research on the new task of object state change anticipation. The source code and the new annotation data (Ego4D-OSCA) will be made publicly available.
翻译:本研究提出:(a) 程序性活动中图像与视频内物体状态变化预测的新问题,(b) 基于Ego4D数据集构建的物体状态变化分类标注数据,(c) 解决这一挑战性问题的首个方法。该新任务的解决方案对基于视觉的场景理解、自动化监控系统与行动规划具有重要意义。所提出的新颖框架通过融合表征近期视觉信息的视觉特征与表征历史物体状态变化及行动的自然语言特征,预测因尚未观察到的人类行为而在近期即将发生的物体状态变化。借助涵盖大量交互场景的第一人称视角视频大规模数据集Ego4D,我们构建了扩展标注集Ego4D-OSCA,为物体状态变化预测任务提供新的标注数据。大量实验评估表明,所提方法能有效预测动态场景中的物体状态变化。该方法的性能同时印证了融合视频与语言线索对提升视频理解系统预测能力的潜力,并为物体状态变化预测这一新任务的后续研究奠定基础。源代码及新标注数据将公开发布。