Advances towards more faithful and traceable answers of Large Language Models (LLMs) are crucial for various research and practical endeavors. One avenue in reaching this goal is basing the answers on reliable sources. However, this Evidence-Based QA has proven to work insufficiently with LLMs in terms of citing the correct sources (source quality) and truthfully representing the information within sources (answer attributability). In this work, we systematically investigate how to robustly fine-tune LLMs for better source quality and answer attributability. Specifically, we introduce a data generation pipeline with automated data quality filters, which can synthesize diversified high-quality training and testing data at scale. We further introduce four test sets to benchmark the robustness of fine-tuned specialist models. Extensive evaluation shows that fine-tuning on synthetic data improves performance on both in- and out-of-distribution. %Evidence-Based QA cases. Furthermore, we show that data quality, which can be drastically improved by proposed quality filters, matters more than quantity in improving Evidence-Based QA.
翻译:实现大型语言模型(LLMs)更忠实、可溯源的回答对于各类研究和实践任务至关重要。实现该目标的一种途径是基于可靠来源给出答案。然而,此类基于证据的问答(Evidence-Based QA)在LLMs中表现欠佳:既难以准确引用正确来源(来源质量),也无法忠实地呈现来源信息(答案可归因性)。本研究系统探究了如何鲁棒地微调LLMs以提升来源质量与答案可归因性。具体而言,我们提出了一种结合自动化数据质量过滤器的数据生成流程,能够规模化合成多样化、高质量的训测数据。同时,我们引入四个测试集以评估微调模型专家的鲁棒性。大量评估表明,基于合成数据微调可提升模型在分布内与分布外场景下的性能。此外,研究证实:通过所提出的质量过滤器能显著提升的数据质量,其对于改进基于证据的问答的作用胜于数据规模。