Artificial neural networks typically struggle in generalizing to out-of-context examples. One reason for this limitation is caused by having datasets that incorporate only partial information regarding the potential correlational structure of the world. In this work, we propose TIDA (Targeted Image-editing Data Augmentation), a targeted data augmentation method focused on improving models' human-like abilities (e.g., gender recognition) by filling the correlational structure gap using a text-to-image generative model. More specifically, TIDA identifies specific skills in captions describing images (e.g., the presence of a specific gender in the image), changes the caption (e.g., "woman" to "man"), and then uses a text-to-image model to edit the image in order to match the novel caption (e.g., uniquely changing a woman to a man while maintaining the context identical). Based on the Flickr30K benchmark, we show that, compared with the original data set, a TIDA-enhanced dataset related to gender, color, and counting abilities induces better performance in several image captioning metrics. Furthermore, on top of relying on the classical BLEU metric, we conduct a fine-grained analysis of the improvements of our models against the baseline in different ways. We compared text-to-image generative models and found different behaviors of the image captioning models in terms of encoding visual encoding and textual decoding.
翻译:人工神经网络通常在泛化至非典型上下文实例时存在困难,其原因之一在于数据集仅包含世界潜在关联结构的局部信息。本研究提出TIDA(目标导向图像编辑数据增强),通过利用文生图生成模型填补关联结构缺口,专注于提升模型类人能力(如性别识别)的目标数据增强方法。具体而言,TIDA识别描述图像的文本中特定技能(如图像中特定性别的存在),修改文本(如将"woman"改为"man"),随后使用文生图模型编辑图像以匹配新文本(如在保持上下文完全一致的前提下将女性唯一转换为男性)。基于Flickr30K基准的实证表明,相较原始数据集,经TIDA增强的性别、颜色及计数能力相关数据集在多项图像描述评估指标中展现了更优性能。此外,在采用经典BLEU指标的基础上,我们从多维度对模型相较于基线模型进行细粒度改进分析。通过对比不同文生图生成模型,发现图像描述模型在视觉编码与文本解码过程中呈现出差异化行为特征。