Temporally localizing the presence of object states in videos is crucial in understanding human activities beyond actions and objects. This task has suffered from a lack of training data due to object states' inherent ambiguity and variety. To avoid exhaustive annotation, learning from transcribed narrations in instructional videos would be intriguing. However, object states are less described in narrations compared to actions, making them less effective. In this work, we propose to extract the object state information from action information included in narrations, using large language models (LLMs). Our observation is that LLMs include world knowledge on the relationship between actions and their resulting object states, and can infer the presence of object states from past action sequences. The proposed LLM-based framework offers flexibility to generate plausible pseudo-object state labels against arbitrary categories. We evaluate our method with our newly collected Multiple Object States Transition (MOST) dataset including dense temporal annotation of 60 object state categories. Our model trained by the generated pseudo-labels demonstrates significant improvement of over 29% in mAP against strong zero-shot vision-language models, showing the effectiveness of explicitly extracting object state information from actions through LLMs.
翻译:在视频中时间定位对象状态的存在,对于理解超越动作和对象的人类活动至关重要。由于对象状态固有的模糊性和多样性,该任务一直面临训练数据匮乏的挑战。为避免详尽的标注,利用教学视频中转录的叙述进行学习是一个有前景的方向。然而,与动作相比,叙述中对对象状态的描述较少,导致其有效性不足。本文提出利用大型语言模型(LLMs)从叙述中包含的动作信息中提取对象状态信息。我们的观察表明,LLMs包含关于动作及其结果对象状态之间关系的世界知识,并能够从过去的动作序列中推断对象状态的存在。所提出的基于LLM的框架具有灵活性,可为任意类别生成合理的伪对象状态标签。我们使用新收集的多对象状态转换(MOST)数据集进行评估,该数据集包含60个对象状态类别的密集时间标注。通过生成的伪标签训练模型,在mAP上比强大的零样本视觉语言模型提升了超过29%,展示了通过LLMs从动作中显式提取对象状态信息的有效性。