Factual consistency evaluation is often conducted using Natural Language Inference (NLI) models, yet these models exhibit limited success in evaluating summaries. Previous work improved such models with synthetic training data. However, the data is typically based on perturbed human-written summaries, which often differ in their characteristics from real model-generated summaries and have limited coverage of possible factual errors. Alternatively, large language models (LLMs) have recently shown promising results in directly evaluating generative tasks, but are too computationally expensive for practical use. Motivated by these limitations, we introduce TrueTeacher, a method for generating synthetic data by annotating diverse model-generated summaries using a LLM. Unlike prior work, TrueTeacher does not rely on human-written summaries, and is multilingual by nature. Experiments on the TRUE benchmark show that a student model trained using our data, substantially outperforms both the state-of-the-art model with similar capacity, and the LLM teacher. In a systematic study, we compare TrueTeacher to existing synthetic data generation methods and demonstrate its superiority and robustness to domain-shift. Using the the mFACE dataset, we also show that our method generalizes to multilingual scenarios. Finally, we release a large-scale synthetic dataset with 1.4M examples generated using TrueTeacher.
翻译:事实一致性评估通常借助自然语言推理(NLI)模型进行,但这些模型在评估摘要时表现有限。以往研究通过合成训练数据改进了此类模型,但这类数据通常基于人工编写摘要的扰动版本,其特征往往与真实模型生成的摘要不同,且对可能的事实错误覆盖范围有限。此外,大语言模型(LLM)近来在直接评估生成式任务方面展现出前景,但其计算成本过高,难以实际应用。针对这些局限性,我们提出TrueTeacher方法——通过利用大语言模型对多样化的模型生成摘要进行标注,从而生成合成数据。与先前工作不同,TrueTeacher不依赖人工编写摘要,且天然支持多语言。在TRUE基准测试上的实验表明,使用我们的数据训练的“学生”模型,其性能显著优于同等规模的最先进模型以及大语言模型“教师”。通过系统性研究,我们将TrueTeacher与现有合成数据生成方法进行对比,证明了其优越性及对领域迁移的鲁棒性。利用mFACE数据集,我们还展示了该方法在多语言场景下的泛化能力。最后,我们发布了一个使用TrueTeacher生成的大规模合成数据集,包含140万条样本。