Compositional actions consist of dynamic (verbs) and static (objects) concepts. Humans can easily recognize unseen compositions using the learned concepts. For machines, solving such a problem requires a model to recognize unseen actions composed of previously observed verbs and objects, thus requiring, so-called, compositional generalization ability. To facilitate this research, we propose a novel Zero-Shot Compositional Action Recognition (ZS-CAR) task. For evaluating the task, we construct a new benchmark, Something-composition (Sth-com), based on the widely used Something-Something V2 dataset. We also propose a novel Component-to-Composition (C2C) learning method to solve the new ZS-CAR task. C2C includes an independent component learning module and a composition inference module. Last, we devise an enhanced training strategy to address the challenges of component variation between seen and unseen compositions and to handle the subtle balance between learning seen and unseen actions. The experimental results demonstrate that the proposed framework significantly surpasses the existing compositional generalization methods and sets a new state-of-the-art. The new Sth-com benchmark and code are available at https://github.com/RongchangLi/ZSCAR_C2C.
翻译:组合动作由动态(动词)和静态(物体)概念构成。人类能够利用已学习的概念轻松识别未见过的组合。对于机器而言,解决此类问题需要模型能够识别由先前观察到的动词和物体构成的未见动作,这需要所谓的组合泛化能力。为促进该方向研究,我们提出了一种新颖的零样本组合动作识别任务。为评估该任务,我们基于广泛使用的Something-Something V2数据集构建了一个新的基准——Something组合数据集。我们还提出了一种新颖的组件到组合学习方法来解决这一新的ZS-CAR任务。C2C包含一个独立的组件学习模块和一个组合推理模块。最后,我们设计了一种增强的训练策略,以应对可见组合与未见组合之间的组件变化挑战,并处理学习可见动作与未见动作之间的微妙平衡。实验结果表明,所提出的框架显著超越了现有的组合泛化方法,并创造了新的最优性能。新的Sth-com基准和代码可在 https://github.com/RongchangLi/ZSCAR_C2C 获取。