Video action segmentation and recognition tasks have been widely applied in many fields. Most previous studies employ large-scale, high computational visual models to understand videos comprehensively. However, few studies directly employ the graph model to reason about the video. The graph model provides the benefits of fewer parameters, low computational cost, a large receptive field, and flexible neighborhood message aggregation. In this paper, we present a graph-based method named Semantic2Graph, to turn the video action segmentation and recognition problem into node classification of graphs. To preserve fine-grained relations in videos, we construct the graph structure of videos at the frame-level and design three types of edges: temporal, semantic, and self-loop. We combine visual, structural, and semantic features as node attributes. Semantic edges are used to model long-term spatio-temporal relations, while the semantic features are the embedding of the label-text based on the textual prompt. A Graph Neural Networks (GNNs) model is used to learn multi-modal feature fusion. Experimental results show that Semantic2Graph achieves improvement on GTEA and 50Salads, compared to the state-of-the-art results. Multiple ablation experiments further confirm the effectiveness of semantic features in improving model performance, and semantic edges enable Semantic2Graph to capture long-term dependencies at a low cost.
翻译:视频动作分割与识别任务已广泛应用于诸多领域。以往研究多采用大规模、高计算量的视觉模型来全面理解视频。然而,鲜有研究直接利用图模型对视频进行推理。图模型具有参数少、计算成本低、感受野大以及灵活邻居消息聚合等优势。本文提出一种基于图的方法——Semantic2Graph,将视频动作分割与识别问题转化为图节点分类问题。为保留视频中的细粒度关系,我们在帧级构建视频的图结构,并设计三类边:时间边、语义边和自环边。我们将视觉特征、结构特征和语义特征融合作为节点属性。语义边用于建模长程时空关系,而语义特征则是基于文本提示的标签文本嵌入。采用图神经网络(GNNs)模型学习多模态特征融合。实验结果表明,与现有最优结果相比,Semantic2Graph在GTEA和50Salads数据集上均取得性能提升。多项消融实验进一步证实了语义特征在提升模型性能方面的有效性,且语义边使Semantic2Graph能够以较低成本捕获长程依赖关系。