In this paper, we introduce Masked Feature Modelling (MFM), a novel approach for the unsupervised pre-training of a Graph Attention Network (GAT) block. MFM utilizes a pretrained Visual Tokenizer to reconstruct masked features of objects within a video, leveraging the MiniKinetics dataset. We then incorporate the pre-trained GAT block into a state-of-the-art bottom-up supervised video-event recognition architecture, ViGAT, to improve the model's starting point and overall accuracy. Experimental evaluations on the YLI-MED dataset demonstrate the effectiveness of MFM in improving event recognition performance.
翻译:本文提出掩蔽特征建模(Masked Feature Modelling, MFM),一种用于图注意力网络(GAT)模块无监督预训练的新方法。MFM利用预训练的视觉标记器(Visual Tokenizer),基于MiniKinetics数据集重构视频中对象的掩蔽特征。随后,我们将预训练的GAT模块融入当前最先进的自底向上监督式视频事件识别架构ViGAT中,以改善模型的初始状态并提升整体准确率。在YLI-MED数据集上的实验评估表明,MFM能有效提升事件识别性能。