Modern alignment pipelines are increasingly replacing expensive human preference labels with evaluations from large language models (LLM-as-Judge). However, AI labels can be systematically biased compared to high-quality human feedback datasets. In this paper, we develop two debiased alignment methods within a general framework that accommodates heterogeneous prompt-response distributions and external human feedback sources. Debiased Direct Preference Optimization (DDPO) augments standard DPO with a residual-based correction and density-ratio reweighting to mitigate systematic bias, while retaining DPO's computational efficiency. Debiased Identity Preference Optimization (DIPO) directly estimates human preference probabilities without imposing a parametric reward model. We provide theoretical guarantees for both methods: DDPO offers a practical and computationally efficient solution for large-scale alignment, whereas DIPO serves as a robust, statistically optimal alternative that attains the semiparametric efficiency bound. Empirical studies on sentiment generation, summarization, and single-turn dialogue demonstrate that the proposed methods substantially improve alignment efficiency and recover performance close to that of an oracle trained on fully human-labeled data.
翻译:现代对齐流程正日益用大型语言模型评估(LLM-as-Judge)替代昂贵的人类偏好标注。然而,与高质量人类反馈数据集相比,AI标注可能存在系统性偏差。本文在通用框架内提出了两种去偏对齐方法,该框架兼容异构的提示-响应分布和外部人类反馈源。去偏直接偏好优化(DDPO)通过残差校正和密度比重加权增强标准DPO以缓解系统性偏差,同时保持DPO的计算效率。去偏恒等偏好优化(DIPO)则无需参数化奖励模型即可直接估计人类偏好概率。我们为两种方法提供了理论保证:DDPO为大规模对齐提供了实用且计算高效的解决方案,而DIPO作为稳健的统计最优替代方案,达到了半参数效率界。在情感生成、文本摘要和单轮对话任务上的实证研究表明,所提方法显著提升了对齐效率,并恢复了接近完全人工标注数据训练的理想模型性能水平。