Image captioning is currently a challenging task that requires the ability to both understand visual information and use human language to describe this visual information in the image. In this paper, we propose an efficient way to improve the image understanding ability of transformer-based method by extending Object Relation Transformer architecture with Attention on Attention mechanism. Experiments on the VieCap4H dataset show that our proposed method significantly outperforms its original structure on both the public test and private test of the Image Captioning shared task held by VLSP.
翻译:图像描述是一项当前具有挑战性的任务,它要求同时具备理解视觉信息的能力以及使用人类语言描述图像中视觉信息的能力。本文提出了一种有效方法,通过将对象关系变换器架构与注意力对注意力机制相结合,来提升基于变换器的方法对图像的理解能力。在VieCap4H数据集上的实验表明,我们提出的方法在VLSP举办的图像描述共享任务的公开测试和私有测试中均显著优于其原始结构。