Advances towards more faithful and traceable answers of Large Language Models (LLMs) are crucial for various research and practical endeavors. One avenue in reaching this goal is basing the answers on reliable sources. However, this Evidence-Based QA has proven to work insufficiently with LLMs in terms of citing the correct sources (source quality) and truthfully representing the information within sources (answer attributability). In this work, we systematically investigate how to robustly fine-tune LLMs for better source quality and answer attributability. Specifically, we introduce a data generation pipeline with automated data quality filters, which can synthesize diversified high-quality training and testing data at scale. We further introduce four test sets to benchmark the robustness of fine-tuned specialist models. Extensive evaluation shows that fine-tuning on synthetic data improves performance on both in- and out-of-distribution. Furthermore, we show that data quality, which can be drastically improved by proposed quality filters, matters more than quantity in improving Evidence-Based QA.
翻译:大型语言模型(LLM)在提供更忠实、可追溯答案方面的进展对于各类研究和实际应用至关重要。实现这一目标的途径之一是将答案建立在可靠来源的基础上。然而,这种基于证据的问答(Evidence-Based QA)在LLM中尚未充分实现,具体表现在引用正确来源(来源质量)以及真实反映来源信息(答案归因性)方面。在本工作中,我们系统性地研究了如何鲁棒地微调LLM以提升来源质量与答案归因性。具体而言,我们引入了包含自动数据质量过滤器的数据生成流水线,该流水线能够大规模合成多样化的高质量训练与测试数据。此外,我们构建了四个测试集以评估经过微调的专家模型的鲁棒性。大量评估表明,基于合成数据的微调能够提升分布内与分布外场景下的性能。进一步地,我们证明数据质量(可通过所提出的质量过滤器显著提升)在提升基于证据的问答效果方面比数据量更为关键。