Human actions in egocentric videos are often hand-object interactions composed from a verb (performed by the hand) applied to an object. Despite their extensive scaling up, egocentric datasets still face two limitations - sparsity of action compositions and a closed set of interacting objects. This paper proposes a novel open vocabulary action recognition task. Given a set of verbs and objects observed during training, the goal is to generalize the verbs to an open vocabulary of actions with seen and novel objects. To this end, we decouple the verb and object predictions via an object-agnostic verb encoder and a prompt-based object encoder. The prompting leverages CLIP representations to predict an open vocabulary of interacting objects. We create open vocabulary benchmarks on the EPIC-KITCHENS-100 and Assembly101 datasets; whereas closed-action methods fail to generalize, our proposed method is effective. In addition, our object encoder significantly outperforms existing open-vocabulary visual recognition methods in recognizing novel interacting objects.
翻译:自我中心视频中的人类动作通常是手-物交互,由对物体施加的动词(由手完成)构成。尽管自我中心数据集已大规模扩展,但仍面临两个限制:动作组合的稀疏性以及交互物体的封闭集合。本文提出一项新的开放词汇动作识别任务。在训练过程中观察到一组动词和物体后,目标是使动词能够泛化到包含已知和未见物体的开放词汇动作中。为此,我们通过物体无关的动词编码器和基于提示的物体编码器解耦动词和物体预测。该提示方法利用CLIP表示来预测开放词汇的交互物体。我们在EPIC-KITCHENS-100和Assembly101数据集上创建了开放词汇基准测试;封闭动作方法无法泛化,而我们的方法有效。此外,我们的物体编码器在识别未见交互物体方面显著优于现有的开放词汇视觉识别方法。