Preference alignment methods such as RLHF and Direct Preference Optimization (DPO) improve instruction following, but they can also reinforce hallucinations when preference judgments reward fluency and confidence over factual correctness. We introduce F-DPO (Factuality-aware Direct Preference Optimization), a simple extension of DPO that uses only binary factuality labels. F-DPO (i) applies a label-flipping transformation that corrects misordered preference pairs so the chosen response is never less factual than the rejected one, and (ii) adds a factuality-aware margin that emphasizes pairs with clear correctness differences, while reducing to standard DPO when both responses share the same factuality. We construct factuality-aware preference data by augmenting DPO pairs with binary factuality indicators and synthetic hallucinated variants. Across seven open-weight LLMs (1B-14B), F-DPO consistently improves factuality and reduces hallucination rates relative to both base models and standard DPO. On Qwen3-8B, F-DPO reduces hallucination rates by five times (from 0.424 to 0.084) while improving factuality scores by 50 percent (from 5.26 to 7.90). F-DPO also generalizes to out-of-distribution benchmarks: on TruthfulQA, Qwen2.5-14B achieves plus 17 percent MC1 accuracy (0.500 to 0.585) and plus 49 percent MC2 accuracy (0.357 to 0.531). F-DPO requires no auxiliary reward model, token-level annotations, or multi-stage training.
翻译:诸如RLHF和直接偏好优化(DPO)等偏好对齐方法虽然能提升指令遵循能力,但也可能强化幻觉,因为偏好判断往往奖励流畅性和置信度而非事实准确性。我们提出了F-DPO(事实感知直接偏好优化),这是DPO的一个简单扩展,仅使用二元事实性标签。F-DPO(i)应用标签翻转变换来纠正错序的偏好对,确保被选中的回复在事实性上绝不劣于被拒绝的回复;(ii)添加一个事实感知边界,强调具有明显正确性差异的配对,而当两个回复具有相同事实性时则退化为标准DPO。我们通过为DPO配对添加二元事实性指标和合成幻觉变体,构建了事实感知偏好数据。在七个开源权重的大语言模型(1B-14B)上,相较于基础模型和标准DPO,F-DPO持续提升了事实性并降低了幻觉率。在Qwen3-8B上,F-DPO将幻觉率降低了五倍(从0.424降至0.084),同时将事实性分数提升了50%(从5.26提高至7.90)。F-DPO也泛化至分布外基准测试:在TruthfulQA上,Qwen2.5-14B实现了MC1准确率提升17%(从0.500到0.585)以及MC2准确率提升49%(从0.357到0.531)。F-DPO无需辅助奖励模型、词元级标注或多阶段训练。