In this paper, we study the local visual modeling with grid features for image captioning, which is critical for generating accurate and detailed captions. To achieve this target, we propose a Locality-Sensitive Transformer Network (LSTNet) with two novel designs, namely Locality-Sensitive Attention (LSA) and Locality-Sensitive Fusion (LSF). LSA is deployed for the intra-layer interaction in Transformer via modeling the relationship between each grid and its neighbors. It reduces the difficulty of local object recognition during captioning. LSF is used for inter-layer information fusion, which aggregates the information of different encoder layers for cross-layer semantical complementarity. With these two novel designs, the proposed LSTNet can model the local visual information of grid features to improve the captioning quality. To validate LSTNet, we conduct extensive experiments on the competitive MS-COCO benchmark. The experimental results show that LSTNet is not only capable of local visual modeling, but also outperforms a bunch of state-of-the-art captioning models on offline and online testings, i.e., 134.8 CIDEr and 136.3 CIDEr, respectively. Besides, the generalization of LSTNet is also verified on the Flickr8k and Flickr30k datasets
翻译:本文研究了基于网格特征的图像描述中的局部视觉建模问题,这对于生成精确且详细的描述至关重要。为实现这一目标,我们提出了一种具有局部敏感性的Transformer网络(LSTNet),其中包含两种创新设计:局部敏感性注意力机制(LSA)和局部敏感性融合机制(LSF)。LSA通过建模每个网格与其相邻网格之间的关系,部署于Transformer的层内交互,从而降低描述过程中局部目标识别的难度。LSF用于层间信息融合,聚合不同编码器层的信息以实现跨层语义互补。通过这两种创新设计,所提出的LSTNet能够对网格特征的局部视觉信息进行建模,从而提升描述质量。为验证LSTNet的有效性,我们在具有竞争力的MS-COCO基准上进行了大量实验。实验结果表明,LSTNet不仅具备局部视觉建模能力,而且在离线测试和在线测试中均优于一系列最先进的描述模型,分别取得了134.8 CIDEr和136.3 CIDEr的评分。此外,LSTNet的泛化能力也在Flickr8k和Flickr30k数据集上得到了验证。