从推理失败中学习：基于合成数据生成的方法 (Learning from Reasoning Failures via Synthetic Data Generation)

Training models on synthetic data has emerged as an increasingly important strategy for improving the performance of generative AI. This approach is particularly helpful for large multimodal models (LMMs) due to the relative scarcity of high-quality paired image-text data compared to language-only data. While a variety of methods have been proposed for generating large multimodal datasets, they do not tailor the synthetic data to address specific deficiencies in the reasoning abilities of LMMs which will be trained with the generated dataset. In contrast, humans often learn in a more efficient manner by seeking out examples related to the types of reasoning where they have failed previously. Inspired by this observation, we propose a new approach for synthetic data generation which is grounded in the analysis of an existing LMM's reasoning failures. Our methodology leverages frontier models to automatically analyze errors produced by a weaker LMM and propose new examples which can be used to correct the reasoning failure via additional training, which are then further filtered to ensure high quality. We generate a large multimodal instruction tuning dataset containing over 553k examples using our approach and conduct extensive experiments demonstrating its utility for improving the performance of LMMs on multiple downstream tasks. Our results show that models trained on our synthetic data can even exceed the performance of LMMs trained on an equivalent amount of additional real data, demonstrating the high value of generating synthetic data targeted to specific reasoning failure modes in LMMs. We will make our dataset and code publicly available.

翻译：在合成数据上训练模型已成为提升生成式人工智能性能的重要策略。该方法对大型多模态模型（LMMs）尤为有益，因为相较于纯文本数据，高质量图文配对数据相对稀缺。尽管已有多种方法被提出用于生成大规模多模态数据集，但这些方法并未针对待训练LMMs在推理能力上的特定缺陷来定制合成数据。相比之下，人类通常通过主动寻找与先前失败推理类型相关的示例来进行更高效的学习。受此启发，我们提出一种基于现有LMM推理失败分析的新型合成数据生成方法。我们的技术利用前沿模型自动分析较弱LMM产生的错误，并提出可通过额外训练纠正推理失败的新示例，随后进一步过滤以确保高质量。采用本方法，我们生成了包含超过55.3万条样本的大规模多模态指令调优数据集，并通过大量实验验证了其在提升LMMs下游任务性能方面的有效性。实验结果表明，使用我们合成数据训练的模型甚至能超越使用等量额外真实数据训练的LMMs，这证明了针对LMMs特定推理失败模式生成合成数据的重要价值。我们将公开数据集与代码。