Synthetic data generation has gained significant attention recently for its utility in training large vision and language models. However, the application of synthetic data to the training of multimodal context-augmented generation systems has been relatively unexplored. This gap in existing work is important because existing vision and language models (VLMs) are not trained specifically for context-augmented generation. Resources for adapting such models are therefore crucial for enabling their use in retrieval-augmented generation (RAG) settings, where a retriever is used to gather relevant information that is then subsequently provided to a generative model via context augmentation. To address this challenging problem, we generate SK-VQA: a large synthetic multimodal dataset containing over 2 million question-answer pairs which require external knowledge to determine the final answer. Our dataset is both larger and significantly more diverse than existing resources of its kind, possessing over 11x more unique questions and containing images from a greater variety of sources than previously-proposed datasets. Through extensive experiments, we demonstrate that our synthetic dataset can not only serve as a challenging benchmark, but is also highly effective for adapting existing generative multimodal models for context-augmented generation.
翻译:合成数据生成因其在训练大规模视觉与语言模型中的实用性而近期受到显著关注。然而,合成数据在训练多模态上下文增强生成系统方面的应用仍相对未被探索。现有研究中的这一空白至关重要,因为现有的视觉与语言模型并非专门针对上下文增强生成进行训练。因此,适配此类模型的资源对于使其能够在检索增强生成(RAG)场景中应用至关重要,在该场景中,检索器被用于收集相关信息,随后通过上下文增强提供给生成模型。为应对这一挑战性问题,我们生成了SK-VQA:一个包含超过200万个需要外部知识来确定最终答案的问答对的大型合成多模态数据集。我们的数据集不仅规模更大,而且比同类现有资源显著更多样化,其独特问题数量超过先前提出数据集的11倍,并且包含来源更多样化的图像。通过大量实验,我们证明我们的合成数据集不仅可以作为一个具有挑战性的基准,而且在适配现有生成式多模态模型以进行上下文增强生成方面也高度有效。