In this paper we present and validate a new synthetic dataset for training visual entailment models. Existing datasets for visual entailment are small and sparse compared to datasets for textual entailment. Manually creating datasets is labor-intensive. We base our synthetic dataset on the SNLI dataset for textual entailment. We take the premise text from SNLI as input prompts in a generative image model, Stable Diffusion, creating an image to replace each textual premise. We evaluate our dataset both intrinsically and extrinsically. For extrinsic evaluation, we evaluate the validity of the generated images by using them as training data for a visual entailment classifier based on CLIP feature vectors. We find that synthetic training data only leads to a slight drop in quality on SNLI-VE, with an F-score 0.686 compared to 0.703 when trained on real data. We also compare the quality of our generated training data to original training data on another dataset: SICK-VTE. Again, there is only a slight drop in F-score: from 0.400 to 0.384. These results indicate that in settings with data sparsity, synthetic data can be a promising solution for training visual entailment models.
翻译:本文提出并验证了一种用于训练视觉蕴含模型的新型合成数据集。与文本蕴含数据集相比,现有的视觉蕴含数据集规模较小且数据稀疏。人工创建数据集劳动密集度高。我们的合成数据集基于文本蕴含数据集SNLI构建。我们将SNLI中的前提文本作为生成式图像模型Stable Diffusion的输入提示,生成图像以替代每个文本前提。我们从内在和外在两个维度评估了数据集质量。在外在评估中,我们通过将生成图像作为基于CLIP特征向量的视觉蕴含分类器的训练数据,评估生成图像的有效性。实验发现,使用合成训练数据仅在SNLI-VE上导致轻微的性能下降(F值从真实数据训练的0.703降至0.686)。我们还在SICK-VTE数据集上比较了生成训练数据与原始训练数据的质量,F值仅从0.400微降至0.384。这些结果表明,在数据稀疏场景下,合成数据可为视觉蕴含模型训练提供有前景的解决方案。