In this work, we introduce the novel concept of visually Connecting Actions and Their Effects (CATE) in video understanding. CATE can have applications in areas like task planning and learning from demonstration. We propose different CATE-based task formulations, such as action selection and action specification, where video understanding models connect actions and effects at semantic and fine-grained levels. We observe that different formulations produce representations capturing intuitive action properties. We also design various baseline models for action selection and action specification. Despite the intuitive nature of the task, we observe that models struggle, and humans outperform them by a large margin. The study aims to establish a foundation for future efforts, showcasing the flexibility and versatility of connecting actions and effects in video understanding, with the hope of inspiring advanced formulations and models.
翻译:在本文中,我们提出了视频理解中“连接动作与其效果”(CATE)这一新概念。CATE可应用于任务规划和从演示中学习等领域。我们提出了基于CATE的不同任务形式,如动作选择和动作规范,使视频理解模型能够在语义和细粒度层面上连接动作与效果。我们观察到,不同的形式能够生成捕捉直观动作属性的表示。我们还针对动作选择和动作规范设计了多种基线模型。尽管该任务具有直观性,但我们发现模型表现困难,而人类的表现远超模型。本研究旨在为未来工作奠定基础,展示在视频理解中连接动作与效果的灵活性和多样性,期望能激发更高级的形式与模型。