Interpretation and understanding of video presents a challenging computer vision task in numerous fields - e.g. autonomous driving and sports analytics. Existing approaches to interpreting the actions taking place within a video clip are based upon Temporal Action Localisation (TAL), which typically identifies short-term actions. The emerging field of Complex Activity Detection (CompAD) extends this analysis to long-term activities, with a deeper understanding obtained by modelling the internal structure of a complex activity taking place within the video. We address the CompAD problem using a hybrid graph neural network which combines attention applied to a graph encoding the local (short-term) dynamic scene with a temporal graph modelling the overall long-duration activity. Our approach is as follows: i) Firstly, we propose a novel feature extraction technique which, for each video snippet, generates spatiotemporal `tubes' for the active elements (`agents') in the (local) scene by detecting individual objects, tracking them and then extracting 3D features from all the agent tubes as well as the overall scene. ii) Next, we construct a local scene graph where each node (representing either an agent tube or the scene) is connected to all other nodes. Attention is then applied to this graph to obtain an overall representation of the local dynamic scene. iii) Finally, all local scene graph representations are interconnected via a temporal graph, to estimate the complex activity class together with its start and end time. The proposed framework outperforms all previous state-of-the-art methods on all three datasets including ActivityNet-1.3, Thumos-14, and ROAD.
翻译:视频的解释与理解在众多领域(如自动驾驶和体育分析)中构成了一项具有挑战性的计算机视觉任务。现有解读视频片段内行为的方法主要基于时序动作定位(TAL),这类方法通常能识别短期动作。新兴的复杂活动检测(CompAD)领域则将分析延伸至长期活动,通过建模视频中复杂活动内部结构来获得更深层次的理解。我们采用一种混合图神经网络来解决CompAD问题,该网络结合了应用于编码局部(短期)动态场景图的注意力机制和建模整体长时间活动的时序图。我们的方法步骤如下:i) 首先,我们提出一种新颖的特征提取技术,该技术通过检测视频片段中的个体对象、对其进行跟踪,并从所有智能体管线和整体场景中提取3D特征,从而为局部场景中的活跃元素(“智能体”)生成时空“管线”;ii) 其次,我们构建一个局部场景图,其中每个节点(代表智能体管线或场景)均与其他所有节点相连。随后对该图应用注意力机制,以获得局部动态场景的整体表征;iii) 最后,所有局部场景图表示通过一个时序图相互连接,以估计复杂活动类别及其起始和结束时间。所提出的框架在包含ActivityNet-1.3、Thumos-14和ROAD的所有三个数据集上均优于所有先前的最先进方法。