Understanding the relationship between different parts of an image is crucial in a variety of applications, including object recognition, scene understanding, and image classification. Despite the fact that Convolutional Neural Networks (CNNs) have demonstrated impressive results in classifying and detecting objects, they lack the capability to extract the relationship between different parts of an image, which is a crucial factor in Human Action Recognition (HAR). To address this problem, this paper proposes a new module that functions like a convolutional layer that uses Vision Transformer (ViT). In the proposed model, the Vision Transformer can complement a convolutional neural network in a variety of tasks by helping it to effectively extract the relationship among various parts of an image. It is shown that the proposed model, compared to a simple CNN, can extract meaningful parts of an image and suppress the misleading parts. The proposed model has been evaluated on the Stanford40 and PASCAL VOC 2012 action datasets and has achieved 95.5% mean Average Precision (mAP) and 91.5% mAP results, respectively, which are promising compared to other state-of-the-art methods.
翻译:理解图像不同部分之间的关系对于多种应用至关重要,包括物体识别、场景理解和图像分类。尽管卷积神经网络(CNNs)在物体分类和检测方面取得了显著成果,但它们缺乏提取图像不同部分之间关系的能力,而这正是人体动作识别(HAR)中的关键因素。为解决这一问题,本文提出了一种新型模块,其功能类似于使用Vision Transformer (ViT)的卷积层。在所提出的模型中,Vision Transformer能够通过帮助卷积神经网络有效提取图像各部分之间的关系,从而在多种任务中对其进行补充。研究表明,与简单的CNN相比,所提模型能够提取图像中有意义的部分并抑制误导性部分。该模型已在Stanford40和PASCAL VOC 2012动作数据集上进行了评估,分别取得了95.5%的平均精度均值(mAP)和91.5%的mAP结果,相较于其他最先进方法展现出令人鼓舞的性能。