Although Large Language Models (LLMs) have made remarkable progress, current preference optimization methods still struggle to align directional consistency while preserving reasoning diversity. To address this limitation, we propose Directional-Groupwise Preference Optimization (DGPO), a lightweight framework that aggregates supervision signals at the group level and explicitly models direction-aware alignment through multi-candidate comparisons. DGPO organizes forward and reverse question-answer instances into structured sets and optimizes a margin-based likelihood objective that separates coherent reasoning paths from inconsistent alternatives. This group-wise formulation captures richer relative information than pairwise objectives and reinforces consistency across diverse reasoning pathways. Empirical results show that our constructed reverse data yields a 3.2% average improvement across five benchmarks, while DGPO further delivers consistent gains across multiple datasets and model families, achieving average accuracy improvements of up to 3.6%.
翻译:尽管大语言模型已取得显著进展,现有偏好优化方法在保持推理多样性的同时仍难以校准方向一致性。为解决此问题,我们提出方向性分组偏好优化——一种轻量级框架,通过在组级别聚合监督信号并利用多候选比较显式建模方向感知对齐。DGPO将正向与反向问答实例组织为结构化集合,并优化基于边际的似然目标函数,从而分离连贯推理路径与不一致替代方案。这种分组范式相比成对目标能捕获更丰富的相对信息,并增强跨多样化推理路径的一致性。实验结果表明,我们构建的反向数据在五个基准测试上带来平均3.2%的提升,而DGPO在多个数据集与模型族中持续取得一致性增益,平均准确率提升最高达3.6%。