Large language models (LLMs) used for multiple-choice and pairwise evaluation tasks often exhibit selection bias due to non-semantic factors like option positions and label symbols. Existing inference-time debiasing is costly and may harm reasoning, while pointwise training ignores that the same question should yield consistent answers across permutations. To address this issue, we propose Permutation-Aware Group Relative Policy Optimization (PA-GRPO), which mitigates selection bias by enforcing permutation-consistent semantic reasoning. PA-GRPO constructs a permutation group for each instance by generating multiple candidate permutations, and optimizes the model using two complementary mechanisms: (1) cross-permutation advantage, which computes advantages relative to the mean reward over all permutations of the same instance, and (2) consistency-aware reward, which encourages the model to produce consistent decisions across different permutations. Experimental results demonstrate that PA-GRPO outperforms strong baselines across seven benchmarks, substantially reducing selection bias while maintaining high overall performance. The code will be made available on Github (https://github.com/ECNU-Text-Computing/PA-GRPO).
翻译:用于多项选择和成对评估任务的大语言模型常因选项位置、标签符号等非语义因素表现出选择偏见。现有推理时去偏方法成本高昂且可能损害推理能力,而逐点训练则忽视了同一问题在不同排列下应产生一致答案的问题。为解决这一问题,我们提出置换感知组相对策略优化(PA-GRPO),通过强制置换一致的语义推理来缓解选择偏见。PA-GRPO为每个实例构建一个置换群(通过生成多个候选排列),并利用两种互补机制优化模型:(1)跨置换优势,即计算相对于同一实例所有排列平均奖励的优势值;(2)一致性感知奖励,鼓励模型在不同排列下产生一致的决策。实验结果表明,PA-GRPO在七个基准测试中均优于强基线方法,在显著降低选择偏见的同时保持了高整体性能。代码将在GitHub上公开(https://github.com/ECNU-Text-Computing/PA-GRPO)。