Truthfulness is paramount for large language models (LLMs) as they are increasingly deployed in real-world applications. However, existing LLMs still struggle with generating truthful content, as evidenced by their modest performance on benchmarks like TruthfulQA. To address this issue, we propose GRAdual self-truTHifying (GRATH), a novel post-processing method to enhance truthfulness of LLMs. GRATH utilizes out-of-domain question prompts to generate pairwise truthfulness training data with each pair containing a question and its correct and incorrect answers, and then optimizes the model via direct preference optimization (DPO) to learn from the truthfulness difference between answer pairs. GRATH iteratively refines truthfulness data and updates the model, leading to a gradual improvement in model truthfulness in a self-supervised manner. Empirically, we evaluate GRATH using different 7B-LLMs and compare with LLMs with similar or even larger sizes on benchmark datasets. Our results show that GRATH effectively improves LLMs' truthfulness without compromising other core capabilities. Notably, GRATH achieves state-of-the-art performance on TruthfulQA, with MC1 accuracy of 54.71% and MC2 accuracy of 69.10%, which even surpass those on 70B-LLMs.
翻译:真实性对于大型语言模型(LLMs)至关重要,尤其是在其越来越多地应用于实际场景的背景下。然而,现有LLMs在生成真实内容方面仍面临挑战,这一点在TruthfulQA等基准测试中的表现可见一斑。为解决此问题,我们提出渐进式自我真实化(GRATH),一种新颖的后处理方法以提升LLMs的真实性。GRATH利用域外问题提示生成成对真实性训练数据,每对包含一个问题及其正确与错误答案,随后通过直接偏好优化(DPO)优化模型,使其从答案对之间的真实性差异中学习。GRATH通过迭代精炼真实性数据并更新模型,以自监督方式逐步提升模型的真实性。实验上,我们在不同7B-LLMs上评估GRATH,并与相似或更大规模的LLMs在基准数据集上进行比较。结果表明,GRATH能有效提升LLMs的真实性,且不损害其他核心能力。值得注意的是,GRATH在TruthfulQA上达到当前最优性能,MC1准确率为54.71%,MC2准确率为69.10%,甚至超越了70B-LLMs的表现。