In this work, following the intuition that adverbs describing scene-sequences are best identified by reasoning over high-level concepts of object-behavior, we propose the design of a new framework that reasons over object-behaviours extracted from raw-video-clips to recognize the clip's corresponding adverb-types. Importantly, while previous works for general scene adverb-recognition assume knowledge of the clips underlying action-types, our method is directly applicable in the more general problem setting where the action-type of a video-clip is unknown. Specifically, we propose a novel pipeline that extracts human-interpretable object-behaviour-facts from raw video clips and propose novel symbolic and transformer based reasoning methods that operate over these extracted facts to identify adverb-types. Experiment results demonstrate that our proposed methods perform favourably against the previous state-of-the-art. Additionally, to support efforts in symbolic video-processing, we release two new datasets of object-behaviour-facts extracted from raw video clips - the MSR-VTT-ASP and ActivityNet-ASP datasets.
翻译:在本工作中,遵循"描述场景序列的副词最适合通过推理物体行为的高层概念来识别"这一直觉,我们提出一种新框架的设计,该框架对从原始视频片段中提取的物体行为进行推理,以识别片段对应的副词类型。重要的是,虽然以往通用场景副词识别工作假设已知片段的基础动作类型,但我们的方法可直接应用于更一般的问题设定——即视频片段动作类型未知。具体而言,我们提出一种新型流水线,从原始视频片段中提取可解释的物体行为事实,并提出基于符号和Transformer的推理方法,对提取的事实进行操作以识别副词类型。实验结果表明,我们提出的方法相较于先前的最优方法表现更优。此外,为支持符号视频处理研究,我们发布两个新数据集——MSR-VTT-ASP和ActivityNet-ASP,其中包含从原始视频片段中提取的物体行为事实。