Factual consistency evaluation is often conducted using Natural Language Inference (NLI) models, yet these models exhibit limited success in evaluating summaries. Previous work improved such models with synthetic training data. However, the data is typically based on perturbed human-written summaries, which often differ in their characteristics from real model-generated summaries and have limited coverage of possible factual errors. Alternatively, large language models (LLMs) have recently shown promising results in directly evaluating generative tasks, but are too computationally expensive for practical use. Motivated by these limitations, we introduce TrueTeacher, a method for generating synthetic data by annotating diverse model-generated summaries using a LLM. Unlike prior work, TrueTeacher does not rely on human-written summaries, and is multilingual by nature. Experiments on the TRUE benchmark show that a student model trained using our data, substantially outperforms both the state-of-the-art model with similar capacity, and the LLM teacher. In a systematic study, we compare TrueTeacher to existing synthetic data generation methods and demonstrate its superiority and robustness to domain-shift. We also show that our method generalizes to multilingual scenarios. Lastly, we release our large scale synthetic dataset (1.4M examples), generated using TrueTeacher, and a checkpoint trained on this data.
翻译:事实一致性评估通常使用自然语言推理(NLI)模型进行,但这些模型在评估摘要时成效有限。先前研究通过合成训练数据改进了此类模型,然而这些数据通常基于对人工撰写摘要进行扰动生成,其特性往往与真实模型生成的摘要存在差异,且对可能的事实错误覆盖范围有限。此外,大语言模型(LLMs)近期在直接评估生成式任务方面展现出可喜成果,但其计算成本过高,难以实际应用。针对这些局限性,我们提出TrueTeacher方法,通过利用LLM标注多样化的模型生成摘要来生成合成数据。与先前工作不同,TrueTeacher不依赖人工撰写摘要,且天然支持多语言场景。在TRUE基准测试上的实验表明,使用我们的数据训练的Student模型,在性能上显著优于同等规模的最先进模型及LLM教师模型。通过系统性研究,我们比较了TrueTeacher与现有合成数据生成方法,证明了其在域迁移场景中的优越性和鲁棒性。此外,我们的方法能泛化至多语言场景。最后,我们发布了利用TrueTeacher生成的大规模合成数据集(140万条样本)及在此数据上训练的模型检查点。