This paper presents ViTOC (Vision Transformer and Object-aware Captioner), a novel vision-language model for image captioning that addresses the challenges of accuracy and diversity in generated descriptions. Unlike conventional approaches, ViTOC employs a dual-path architecture based on Vision Transformer and object detector, effectively fusing global visual features and local object information through learnable vectors. The model introduces an innovative object-aware prompting strategy that significantly enhances its capability in handling long-tail data. Experiments on the standard COCO dataset demonstrate that ViTOC outperforms baseline models across all evaluation metrics. Additionally, we propose a reference-free evaluation method based on CLIP to further validate the model's effectiveness. By utilizing pretrained visual model parameters, ViTOC achieves efficient end-to-end training.
翻译:本文提出ViTOC(Vision Transformer and Object-aware Captioner),一种用于图像描述生成的新型视觉语言模型,旨在解决生成描述在准确性与多样性方面的挑战。与传统方法不同,ViTOC采用基于视觉Transformer与目标检测器的双路径架构,通过可学习向量有效融合全局视觉特征与局部目标信息。该模型引入创新的目标感知提示策略,显著提升了处理长尾数据的能力。在标准COCO数据集上的实验表明,ViTOC在所有评估指标上均优于基线模型。此外,我们提出一种基于CLIP的无参考评估方法,以进一步验证模型的有效性。通过利用预训练视觉模型参数,ViTOC实现了高效的端到端训练。