Human actions in egocentric videos are often hand-object interactions composed from a verb (performed by the hand) applied to an object. Despite their extensive scaling up, egocentric datasets still face two limitations - sparsity of action compositions and a closed set of interacting objects. This paper proposes a novel open vocabulary action recognition task. Given a set of verbs and objects observed during training, the goal is to generalize the verbs to an open vocabulary of actions with seen and novel objects. To this end, we decouple the verb and object predictions via an object-agnostic verb encoder and a prompt-based object encoder. The prompting leverages CLIP representations to predict an open vocabulary of interacting objects. We create open vocabulary benchmarks on the EPIC-KITCHENS-100 and Assembly101 datasets; whereas closed-action methods fail to generalize, our proposed method is effective. In addition, our object encoder significantly outperforms existing open-vocabulary visual recognition methods in recognizing novel interacting objects.
翻译:自我中心视频中的人类动作通常是手与物体交互构成的组合,由手执行的动词作用于某个物体。尽管大规模扩展,自我中心数据集仍面临两项限制——动作组合的稀疏性和交互物体的封闭集合。本文提出一种全新的开放词汇动作识别任务:给定训练过程中观察到的一组动词和物体,目标是使动词泛化到包含已知和新型物体的开放词汇动作。为此,我们通过物体无关动词编码器和基于提示的物体编码器来解耦动词与物体的预测。该提示方法利用CLIP表征预测开放词汇的交互物体。我们在EPIC-KITCHENS-100和Assembly101数据集上构建了开放词汇基准;在封闭动作方法无法泛化的情况下,我们提出的方法表现出有效性。此外,我们的物体编码器在识别新型交互物体方面显著优于现有开放词汇视觉识别方法。