In this work, following the intuition that adverbs describing scene-sequences are best identified by reasoning over high-level concepts of object-behavior, we propose the design of a new framework that reasons over object-behaviours extracted from raw-video-clips to recognize the clip's corresponding adverb-types. Importantly, while previous works for general scene adverb-recognition assume knowledge of the clips underlying action-types, our method is directly applicable in the more general problem setting where the action-type of a video-clip is unknown. Specifically, we propose a novel pipeline that extracts human-interpretable object-behaviour-facts from raw video clips and propose novel symbolic and transformer based reasoning methods that operate over these extracted facts to identify adverb-types. Experiment results demonstrate that our proposed methods perform favourably against the previous state-of-the-art. Additionally, to support efforts in symbolic video-processing, we release two new datasets of object-behaviour-facts extracted from raw video clips - the MSR-VTT-ASP and ActivityNet-ASP datasets.
翻译:在这项工作中,基于“描述场景序列的副词最好通过对物体行为高级概念的推理来识别”这一直觉,我们提出了一种新型框架的设计,该框架对从原始视频片段中提取的物体行为进行推理,以识别片段对应的副词类型。重要的是,以往针对通用场景副词识别的研究都假定已知片段的基础动作类型,而我们的方法可直接应用于更一般的问题场景——即视频片段的动作类型未知。具体而言,我们提出了一种新颖的流水线,从原始视频片段中提取可解释的物体行为事实,并提出了基于符号逻辑和Transformer的推理方法,对这些提取的事实进行运算以识别副词类型。实验结果表明,我们提出的方法优于先前的最先进技术。此外,为支持符号化视频处理的研究,我们发布了两个从原始视频片段中提取的物体行为事实新数据集——MSR-VTT-ASP和ActivityNet-ASP数据集。