Understanding the relationship between different parts of the image plays a crucial role in many visual recognition tasks. Despite the fact that Convolutional Neural Networks (CNNs) have demonstrated impressive results in detecting single objects, they lack the capability to extract the relationship between various regions of an image, which is a crucial factor in human action recognition. To address this problem, this paper proposes a new module that functions like a convolutional layer using Vision Transformer (ViT). The proposed action recognition model comprises two components: the first part is a deep convolutional network that extracts high-level spatial features from the image, and the second component of the model utilizes a Vision Transformer that extracts the relationship between various regions of the image using the feature map generated by the CNN output. The proposed model has been evaluated on the Stanford40 and PASCAL VOC 2012 action datasets and has achieved 95.5% mAP and 91.5% mAP results, respectively, which are promising compared to other state-of-the-art methods.
翻译:理解图像不同部分之间的关系在许多视觉识别任务中起着关键作用。尽管卷积神经网络在检测单个物体方面表现出色,但它们缺乏提取图像不同区域之间关系的能力,而这一能力对人体动作识别至关重要。为解决这一问题,本文提出了一种新型模块,其功能类似于使用Vision Transformer的卷积层。所提出的动作识别模型包含两个组件:第一部分是一个深度卷积网络,用于从图像中提取高层的空间特征;第二部分使用Vision Transformer,通过CNN输出的特征图提取图像不同区域之间的关系。该模型在Stanford40和PASCAL VOC 2012动作数据集上进行了评估,分别达到了95.5%的mAP和91.5%的mAP,这一结果与当前最先进的方法相比具有一定优势。