Preference alignment methods such as RLHF and Direct Preference Optimization (DPO) improve instruction following, but they can also reinforce hallucinations when preference judgments reward fluency and confidence over factual correctness. We introduce F-DPO (Factuality-aware Direct Preference Optimization), a simple extension of DPO that uses only binary factuality labels. F-DPO (i) applies a label-flipping transformation that corrects misordered preference pairs so the chosen response is never less factual than the rejected one, and (ii) adds a factuality-aware margin that emphasizes pairs with clear correctness differences, while reducing to standard DPO when both responses share the same factuality. We construct factuality-aware preference data by augmenting DPO pairs with binary factuality indicators and synthetic hallucinated variants. Across seven open-weight LLMs (1B-14B), F-DPO consistently improves factuality and reduces hallucination rates relative to both base models and standard DPO. On Qwen3-8B, F-DPO reduces hallucination rates by 5x(from 0.424 to 0.084) while improving factuality scores by 50% (from 5.26 to 7.90). F-DPO also generalizes to out-of-distribution benchmarks: on TruthfulQA, Qwen2.5-14B achieves +17% MC1 accuracy (0.500 to 0.585) and +49% MC2 accuracy (0.357 to 0.531). F-DPO requires no auxiliary reward model, token-level annotations, or multi-stage training.
翻译:偏好对齐方法(如RLHF和直接偏好优化DPO)能够提升指令遵循能力,但当偏好判断奖励流畅性和信心而非事实准确性时,也可能强化幻觉。我们提出F-DPO(事实性感知直接偏好优化),这是DPO的一种简单扩展,仅需使用二元事实性标签。F-DPO (i) 应用标签翻转变换,纠正顺序错误的偏好对,使选择的回应在事实上始终优于被拒绝的回应;(ii) 添加事实性感知边界,强调准确性差异明显的配对,当两个回应具有相同事实性时则退化为标准DPO。我们通过为DPO配对补充二元事实性指标和合成幻觉变体,构建事实性感知偏好数据。在七个开源大语言模型(1B-14B参数)上,相较于基座模型和标准DPO,F-DPO均持续提升事实性并降低幻觉率。在Qwen3-8B上,F-DPO将幻觉率降低5倍(从0.424降至0.084),同时将事实性评分提升50%(从5.26升至7.90)。F-DPO还能泛化至分布外基准:在TruthfulQA上,Qwen2.5-14B的MC1准确率提升+17%(从0.500升至0.585),MC2准确率提升+49%(从0.357升至0.531)。F-DPO无需辅助奖励模型、词符级标注或多阶段训练。