Advances towards more faithful and traceable answers of Large Language Models (LLMs) are crucial for various research and practical endeavors. One avenue in reaching this goal is basing the answers on reliable sources. However, this Evidence-Based QA has proven to work insufficiently with LLMs in terms of citing the correct sources (source quality) and truthfully representing the information within sources (answer attributability). In this work, we systematically investigate how to robustly fine-tune LLMs for better source quality and answer attributability. Specifically, we introduce a data generation pipeline with automated data quality filters, which can synthesize diversified high-quality training and testing data at scale. We further introduce four test sets to benchmark the robustness of fine-tuned specialist models. Extensive evaluation shows that fine-tuning on synthetic data improves performance on both in- and out-of-distribution. Furthermore, we show that data quality, which can be drastically improved by proposed quality filters, matters more than quantity in improving Evidence-Based QA.
翻译:提升大型语言模型(LLM)回答的忠实性与可追溯性,对于各类研究与实践工作至关重要。实现该目标的一种途径是将答案建立在可靠来源之上。然而,现有研究表明,在基于证据的问答任务中,LLM在引用正确来源(来源质量)及真实反映来源信息(答案可归因性)方面仍存在不足。本研究系统性地探讨了如何通过鲁棒的微调方法提升LLM的来源质量与答案可归因性。具体而言,我们提出了包含自动化数据质量过滤器的数据生成流程,能够大规模合成多样化的高质量训练与测试数据。我们进一步构建了四个测试集,用于评估微调后专家模型的鲁棒性。大量实验表明,基于合成数据的微调能同时提升模型在分布内与分布外的性能。此外,我们发现数据质量(通过所提出的质量过滤器可显著提升)比数据数量对改进基于证据的问答更为重要。