Image captioning is a challenging task involving generating a textual description for an image using computer vision and natural language processing techniques. This paper proposes a deep neural framework for image caption generation using a GRU-based attention mechanism. Our approach employs multiple pre-trained convolutional neural networks as the encoder to extract features from the image and a GRU-based language model as the decoder to generate descriptive sentences. To improve performance, we integrate the Bahdanau attention model with the GRU decoder to enable learning to focus on specific image parts. We evaluate our approach using the MSCOCO and Flickr30k datasets and show that it achieves competitive scores compared to state-of-the-art methods. Our proposed framework can bridge the gap between computer vision and natural language and can be extended to specific domains.
翻译:图像描述是一项具有挑战性的任务,涉及利用计算机视觉和自然语言处理技术为图像生成文本描述。本文提出了一种基于GRU注意力机制的深度神经框架用于图像描述生成。该方法采用多个预训练的卷积神经网络作为编码器提取图像特征,并以基于GRU的语言模型作为解码器生成描述性语句。为了提升性能,我们将Bahdanau注意力模型与GRU解码器相结合,使模型能够学会关注图像的特定区域。我们使用MSCOCO和Flickr30k数据集评估了该方法,并证明其相较于现有最先进方法取得了具有竞争力的评分。所提出的框架能够弥合计算机视觉与自然语言之间的差距,并可扩展至特定领域。