Natural Language Explanation (NLE) aims to elucidate the decision-making process by providing detailed, human-friendly explanations in natural language. It helps demystify the decision-making processes of large vision-language models (LVLMs) through the use of language models. While existing methods for creating a Vision Question-Answering with Natural Language Explanation (VQA-NLE) datasets can provide explanations, they heavily rely on human annotations that are time-consuming and costly. In this study, we propose a novel approach that leverages LVLMs to efficiently generate high-quality synthetic VQA-NLE datasets. By evaluating our synthetic data, we showcase how advanced prompting techniques can lead to the production of high-quality VQA-NLE data. Our findings indicate that this proposed method achieves up to 20x faster than human annotation, with only a minimal decrease in qualitative metrics, achieving robust quality that is nearly equivalent to human-annotated data. Furthermore, we show that incorporating visual prompts significantly enhances the relevance of text generation. Our study paves the way for a more efficient and robust automated generation of multi-modal NLE data, offering a promising solution to the problem.
翻译:自然语言解释旨在通过提供详细、易于理解的自然语言描述来阐明决策过程。它借助语言模型帮助揭示大型视觉语言模型的决策机制。尽管现有构建视觉问答自然语言解释数据集的方法能够提供解释,但它们严重依赖耗时且成本高昂的人工标注。本研究提出一种新颖方法,利用大型视觉语言模型高效生成高质量的合成VQA-NLE数据集。通过对合成数据的评估,我们展示了先进提示技术如何促成高质量VQA-NLE数据的生成。研究结果表明,该方法比人工标注快达20倍,且质量指标仅轻微下降,达到与人工标注数据近乎相当的稳健质量。此外,我们发现融入视觉提示能显著增强文本生成的相关性。本研究为更高效、稳健地自动生成多模态NLE数据开辟了道路,为该问题提供了有前景的解决方案。