In several real-world scenarios like autonomous navigation and mobility, to obtain a better visual understanding of the surroundings, image captioning and object detection play a crucial role. This work introduces a novel multitask learning framework that combines image captioning and object detection into a joint model. We propose TICOD, Transformer-based Image Captioning and Object detection model for jointly training both tasks by combining the losses obtained from image captioning and object detection networks. By leveraging joint training, the model benefits from the complementary information shared between the two tasks, leading to improved performance for image captioning. Our approach utilizes a transformer-based architecture that enables end-to-end network integration for image captioning and object detection and performs both tasks jointly. We evaluate the effectiveness of our approach through comprehensive experiments on the MS-COCO dataset. Our model outperforms the baselines from image captioning literature by achieving a 3.65% improvement in BERTScore.
翻译:在自主导航与移动等实际场景中,为获得对周围环境更优的视觉理解,图像描述与目标检测发挥着关键作用。本文提出了一种新颖的多任务学习框架,将图像描述与目标检测融合为联合模型。我们提出了基于Transformer的图像描述与目标检测模型(TICOD),通过结合图像描述网络与目标检测网络产生的损失,对两项任务进行联合训练。借助联合训练机制,模型可充分挖掘两项任务间共享的互补信息,从而提升图像描述的生成质量。本方法采用基于Transformer的架构,实现了图像描述与目标检测的端到端网络集成,并同步完成两项任务。通过在MS-COCO数据集上的全面实验,我们评估了所提方法的有效性。与现有图像描述文献中的基线模型相比,本模型在BERTScore上实现了3.65%的性能提升。