Recently, attention based models have been used extensively in many sequence-to-sequence learning systems. Especially for image captioning, the attention based models are expected to ground correct image regions with proper generated words. However, for each time step in the decoding process, the attention based models usually use the hidden state of the current input to attend to the image regions. Under this setting, these attention models have a "deviated focus" problem that they calculate the attention weights based on previous words instead of the one to be generated, impairing the performance of both grounding and captioning. In this paper, we propose the Prophet Attention, similar to the form of self-supervision. In the training stage, this module utilizes the future information to calculate the "ideal" attention weights towards image regions. These calculated "ideal" weights are further used to regularize the "deviated" attention. In this manner, image regions are grounded with the correct words. The proposed Prophet Attention can be easily incorporated into existing image captioning models to improve their performance of both grounding and captioning. The experiments on the Flickr30k Entities and the MSCOCO datasets show that the proposed Prophet Attention consistently outperforms baselines in both automatic metrics and human evaluations. It is worth noticing that we set new state-of-the-arts on the two benchmark datasets and achieve the 1st place on the leaderboard of the online MSCOCO benchmark in terms of the default ranking score, i.e., CIDEr-c40.
翻译:近年来,基于注意力的模型已广泛应用于众多序列到序列学习系统中。尤其在图像描述任务中,此类模型期望将正确的图像区域与恰当生成的词语进行对齐。然而,在解码过程的每个时间步,基于注意力的模型通常利用当前输入的隐藏状态来关注图像区域。在此设定下,这些注意力模型存在“焦点偏移”问题,即它们根据先前词语而非待生成词语计算注意力权重,从而损害了定位与描述的性能。本文提出先知注意力机制,其形式类似于自监督学习。在训练阶段,该模块利用未来信息计算图像区域的“理想”注意力权重,并进一步将这些计算得到的“理想”权重用于正则化“偏移”的注意力。通过这种方式,图像区域得以与正确词语对齐。所提出的先知注意力可轻易融入现有图像描述模型,以提升其在定位与描述两方面的性能。在Flickr30k Entities和MSCOCO数据集上的实验表明,先知注意力在自动评估指标与人工评估中均持续优于基线方法。值得注意的是,我们在两个基准数据集上创下了新的最优结果,并在在线MSCOCO基准排行榜上依据默认排名得分(即CIDEr-c40)位列第一。