Advances towards more faithful and traceable answers of Large Language Models (LLMs) are crucial for various research and practical endeavors. One avenue in reaching this goal is basing the answers on reliable sources. However, this Evidence-Based QA has proven to work insufficiently with LLMs in terms of citing the correct sources (source quality) and truthfully representing the information within sources (answer attributability). In this work, we systematically investigate how to robustly fine-tune LLMs for better source quality and answer attributability. Specifically, we introduce a data generation pipeline with automated data quality filters, which can synthesize diversified high-quality training and testing data at scale. We further introduce four test sets to benchmark the robustness of fine-tuned specialist models. Extensive evaluation shows that fine-tuning on synthetic data improves performance on both in- and out-of-distribution. Furthermore, we show that data quality, which can be drastically improved by proposed quality filters, matters more than quantity in improving Evidence-Based QA.
翻译:实现大型语言模型(LLMs)更忠实、可溯源的答案,对于各种研究和实践至关重要。达成这一目标的途径之一是将答案建立在可靠来源之上。然而,这种基于证据的问答(Evidence-Based QA)在LLMs中已被证明效果不足,表现在引用正确来源(来源质量)以及如实呈现来源中的信息(答案归因性)方面。在本研究中,我们系统探讨了如何稳健地微调LLMs以提升来源质量与答案归因性。具体而言,我们提出了一种配备自动化数据质量过滤器的数据生成流水线,能够规模化合成多样化的高质量训练与测试数据。我们进一步引入了四个测试集,用于评估微调专家模型的鲁棒性。广泛评估表明,在合成数据上进行微调能够提升模型在分布内与分布外场景下的性能。此外,我们证实数据质量(可通过所提质量过滤器显著提升)比数据量更能改善基于证据的问答效果。