Image Captioning is one of the vision-language tasks that still interest the research community worldwide in the 2020s. MS-COCO Caption benchmark is commonly used to evaluate the performance of advanced captioning models, although it was published in 2015. Recent captioning models trained on the MS-COCO Caption dataset only have good performance in language patterns of English; they do not have such good performance in contexts captured in Vietnam or fluently caption images using Vietnamese. To contribute to the low-resources research community as in Vietnam, we introduce a novel image captioning dataset in Vietnamese, the Open-domain Vietnamese Image Captioning dataset (UIT-OpenViIC). The introduced dataset includes complex scenes captured in Vietnam and manually annotated by Vietnamese under strict rules and supervision. In this paper, we present in more detail the dataset creation process. From preliminary analysis, we show that our dataset is challenging to recent state-of-the-art (SOTA) Transformer-based baselines, which performed well on the MS COCO dataset. Then, the modest results prove that UIT-OpenViIC has room to grow, which can be one of the standard benchmarks in Vietnamese for the research community to evaluate their captioning models. Furthermore, we present a CAMO approach that effectively enhances the image representation ability by a multi-level encoder output fusion mechanism, which helps improve the quality of generated captions compared to previous captioning models.
翻译:图像描述是视觉-语言任务之一,在21世纪20年代仍受到全球研究界的关注。尽管MS-COCO Caption基准数据集发布于2015年,但它仍普遍用于评估先进描述模型的性能。当前基于MS-COCO Caption数据集训练的模型仅在英语语言模式中表现良好,无法在越南场景的语境中有效描述图像,也无法用越南语流畅生成描述。为了支持越南等低资源语言研究社区,我们提出一个新颖的越南语图像描述数据集——开放域越南语图像描述数据集(UIT-OpenViIC)。该数据集包含在越南拍摄的复杂场景,并由越南语人员根据严格规则和监督进行人工标注。本文详细介绍了数据集的构建过程。初步分析表明,该数据集对近期在MS COCO数据集上表现优异的最先进Transformer基线模型具有挑战性。此外,适中的实验结果证明UIT-OpenViIC具有发展空间,可作为研究界评估越南语描述模型的标准基准之一。进一步地,我们提出CAMO方法,通过多层级编码器输出融合机制有效增强图像表示能力,与先前模型相比显著提升了生成描述的质量。