Object State Changes (OSCs) are pivotal for video understanding. While humans can effortlessly generalize OSC understanding from familiar to unknown objects, current approaches are confined to a closed vocabulary. Addressing this gap, we introduce a novel open-world formulation for the video OSC problem. The goal is to temporally localize the three stages of an OSC -- the object's initial state, its transitioning state, and its end state -- whether or not the object has been observed during training. Towards this end, we develop VidOSC, a holistic learning approach that: (1) leverages text and vision-language models for supervisory signals to obviate manually labeling OSC training data, and (2) abstracts fine-grained shared state representations from objects to enhance generalization. Furthermore, we present HowToChange, the first open-world benchmark for video OSC localization, which offers an order of magnitude increase in the label space and annotation volume compared to the best existing benchmark. Experimental results demonstrate the efficacy of our approach, in both traditional closed-world and open-world scenarios.
翻译:物体状态变化(OSC)对于视频理解至关重要。尽管人类能够轻松地将OSC理解从熟悉物体泛化到未知物体,但现有方法局限于封闭词汇表。为弥补这一不足,我们提出了一种针对视频OSC问题的新型开放世界范式。其目标是时序定位OSC的三个阶段——物体的初始状态、过渡状态和结束状态——无论该物体是否在训练过程中出现过。为此,我们开发了VidOSC这一整体性学习方法,该方法:(1) 利用文本和视觉-语言模型生成监督信号,从而避免手动标注OSC训练数据;(2) 从物体中抽象出细粒度共享状态表示以增强泛化能力。此外,我们提出了HowToChange——首个面向视频OSC定位的开放世界基准数据集,其标签空间和标注量相比现有最佳基准提升了一个数量级。实验结果证明了我们的方法在传统封闭世界和开放世界场景中的有效性。