Situation Recognition is the task of generating a structured summary of what is happening in an image using an activity verb and the semantic roles played by actors and objects. In this task, the same activity verb can describe a diverse set of situations as well as the same actor or object category can play a diverse set of semantic roles depending on the situation depicted in the image. Hence a situation recognition model needs to understand the context of the image and the visual-linguistic meaning of semantic roles. Therefore, we leverage the CLIP foundational model that has learned the context of images via language descriptions. We show that deeper-and-wider multi-layer perceptron (MLP) blocks obtain noteworthy results for the situation recognition task by using CLIP image and text embedding features and it even outperforms the state-of-the-art CoFormer, a Transformer-based model, thanks to the external implicit visual-linguistic knowledge encapsulated by CLIP and the expressive power of modern MLP block designs. Motivated by this, we design a cross-attention-based Transformer using CLIP visual tokens that model the relation between textual roles and visual entities. Our cross-attention-based Transformer known as ClipSitu XTF outperforms existing state-of-the-art by a large margin of 14.1\% on semantic role labelling (value) for top-1 accuracy using imSitu dataset. {Similarly, our ClipSitu XTF obtains state-of-the-art situation localization performance.} We will make the code publicly available.
翻译:情境识别是一项任务,旨在通过活动动词以及行动者和物体所扮演的语义角色,生成图像中发生事件的结构化摘要。在该任务中,同一活动动词可以描述多种不同的情境,同一行动者或物体类别也可根据图像中所描绘的情境扮演多种不同的语义角色。因此,情境识别模型需要理解图像的上下文以及语义角色的视觉-语言意义。为此,我们利用CLIP基础模型,该模型通过语言描述学习了图像的上下文。我们证明,通过使用CLIP图像和文本嵌入特征,更深更宽的多层感知器(MLP)模块在情境识别任务中取得了显著结果,甚至超越了基于Transformer的当前最优模型CoFormer,这得益于CLIP所封装的外部隐式视觉-语言知识以及现代MLP模块设计的表达能力。受此启发,我们设计了一种基于交叉注意力的Transformer,利用CLIP视觉令牌来建模文本角色与视觉实体之间的关系。我们的基于交叉注意力的Transformer(称为ClipSitu XTF)在imSitu数据集上的语义角色标注(值)top-1准确率方面,以14.1%的大幅优势超越了现有最优方法。同样,我们的ClipSitu XTF在情境定位性能上也达到了最新水平。我们将公开代码。