TAB-PO: Preference Optimization with a Token-Level Adaptive Barrier for Token-Critical Structured Generation

Samah Fodeh,Linhai Ma,Ganesh Puthiaraju,Srivani Talakokkul,Afshan Khan,Sreeraj Ramachandran,Elyas Irankhah,Muhammad Arif,Ashley Hagaman,Sarah R. Lowe,Aimee Kendall Roundtree

Direct Preference Optimization (DPO) is an effective and widely adopted approach for offline alignment but is poorly matched to ontology-driven structured prediction, where preferred and rejected JSON objects often differ in only a few schema-defining tokens. In this low-edit-distance regime, sequence-level DPO spreads gradient mass across non-critical serialization tokens (gradient dilution) and can reduce likelihood on rare, under-confident preferred schema tokens (token erosion). To address these limitations, we first develop a confusion-aware preference-construction strategy that augments expert-curated ambiguity patterns with empirical structured-error modes estimated from validation-set SFT predictions, synthesizing minimally perturbed, schema-valid negatives that focus preference learning on realistic ontology-level decision errors. We then introduce Token-Adaptive Barrier Preference Optimization (TAB-PO), a post-SFT objective for token-critical structured generation. TAB-PO adds a confidence-gated token-level barrier that applies supervised anchoring to under-confident schema tokens. On the public SciERC scientific information extraction task, evaluated with Llama/Qwen models from 1.5B to 70B, TAB-PO improves ontology-critical semantic-label and relational-linking metrics over SFT by 11.59% on average, wins 100% of comparisons against the strongest token-level and sequence-level DPO variants on these metrics, and surpasses leading frontier models by 14.71%, while delivering strong gains in textual grounding.

翻译：摘要：直接偏好优化（DPO）是一种广泛采用的有效离线对齐方法，但难以适用于本体驱动的结构化预测任务——在此类任务中，偏好与拒绝的JSON对象通常仅在少数模式定义令牌上存在差异。在低编辑距离场景下，序列级DPO会将梯度质量分散至非关键序列化令牌（梯度稀释），并可能降低稀有且置信度较低偏好模式令牌的似然值（令牌侵蚀）。为克服这些局限，我们首先提出一种混淆感知的偏好构建策略：将专家设定的歧义模式与验证集SFT预测中经验性结构化错误模式相结合，合成最小扰动且符合模式有效性约束的负例样本，使偏好学习聚焦于实际的本体层面决策错误。继而引入面向关键令牌结构化生成的后SFT优化目标——令牌自适应屏障偏好优化（TAB-PO）。该方法通过置信度门控机制添加令牌级屏障，对低置信度模式令牌施加监督锚定。在公开SciERC科学信息抽取任务中，采用Llama/Qwen模型（参数量1.5B至70B）进行评估，TAB-PO相较SFT在本体关键语义标签与关系链接指标上平均提升11.59%，以100%胜率超越最强令牌级与序列级DPO变体，并领先前沿基础模型14.71%，同时显著增强文本基准对齐能力。