Direct Preference Optimization (DPO) is an effective and widely adopted approach for offline alignment but is poorly matched to ontology-driven structured prediction, where preferred and rejected JSON objects often differ in only a few schema-defining tokens. In this low-edit-distance regime, sequence-level DPO spreads gradient mass across non-critical serialization tokens (gradient dilution) and can reduce likelihood on rare, under-confident preferred schema tokens (token erosion). To address these limitations, we first develop a confusion-aware preference-construction strategy that augments expert-curated ambiguity patterns with empirical structured-error modes estimated from validation-set SFT predictions, synthesizing minimally perturbed, schema-valid negatives that focus preference learning on realistic ontology-level decision errors. We then introduce Token-Adaptive Barrier Preference Optimization (TAB-PO), a post-SFT objective for token-critical structured generation. TAB-PO adds a confidence-gated token-level barrier that applies supervised anchoring to under-confident schema tokens. On the public SciERC scientific information extraction task, evaluated with Llama/Qwen models from 1.5B to 70B, TAB-PO improves ontology-critical semantic-label and relational-linking metrics over SFT by 11.59% on average, wins 100% of comparisons against the strongest token-level and sequence-level DPO variants on these metrics, and surpasses leading frontier models by 14.71%, while delivering strong gains in textual grounding.
翻译:摘要:直接偏好优化(DPO)是一种广泛采用的有效离线对齐方法,但难以适用于本体驱动的结构化预测任务——在此类任务中,偏好与拒绝的JSON对象通常仅在少数模式定义令牌上存在差异。在低编辑距离场景下,序列级DPO会将梯度质量分散至非关键序列化令牌(梯度稀释),并可能降低稀有且置信度较低偏好模式令牌的似然值(令牌侵蚀)。为克服这些局限,我们首先提出一种混淆感知的偏好构建策略:将专家设定的歧义模式与验证集SFT预测中经验性结构化错误模式相结合,合成最小扰动且符合模式有效性约束的负例样本,使偏好学习聚焦于实际的本体层面决策错误。继而引入面向关键令牌结构化生成的后SFT优化目标——令牌自适应屏障偏好优化(TAB-PO)。该方法通过置信度门控机制添加令牌级屏障,对低置信度模式令牌施加监督锚定。在公开SciERC科学信息抽取任务中,采用Llama/Qwen模型(参数量1.5B至70B)进行评估,TAB-PO相较SFT在本体关键语义标签与关系链接指标上平均提升11.59%,以100%胜率超越最强令牌级与序列级DPO变体,并领先前沿基础模型14.71%,同时显著增强文本基准对齐能力。