Advances towards more faithful and traceable answers of Large Language Models (LLMs) are crucial for various research and practical endeavors. One avenue in reaching this goal is basing the answers on reliable sources. However, this Evidence-Based QA has proven to work insufficiently with LLMs in terms of citing the correct sources (source quality) and truthfully representing the information within sources (answer attributability). In this work, we systematically investigate how to robustly fine-tune LLMs for better source quality and answer attributability. Specifically, we introduce a data generation pipeline with automated data quality filters, which can synthesize diversified high-quality training and testing data at scale. We further introduce four test sets to benchmark the robustness of fine-tuned specialist models. Extensive evaluation shows that fine-tuning on synthetic data improves performance on both in- and out-of-distribution. Furthermore, we show that data quality, which can be drastically improved by proposed quality filters, matters more than quantity in improving Evidence-Based QA.
翻译:大语言模型(LLM)在提供更忠实、可追溯答案方面的进展,对各类研究和实践应用至关重要。实现这一目标的方法之一是将答案建立在可靠来源之上。然而,这种基于证据的问答(Evidence-Based QA)在LLM中表现出不足,主要体现在引用正确来源(来源质量)以及忠实呈现来源信息(答案归因性)两方面。本研究系统探究如何鲁棒地微调LLM以提升来源质量与答案归因性。具体而言,我们引入一个包含自动化数据质量过滤器的数据生成流水线,可规模化合成多样化、高质量的训练与测试数据。进一步地,我们构建四个测试集以基准化微调专家模型的鲁棒性。广泛评估表明,基于合成数据微调能提升模型在分布内与分布外场景的性能。此外,我们证实数据质量(可通过所提出的质量过滤器大幅提升)在改进基于证据的问答方面比数据数量更为关键。