Reading order detection is the foundation of document understanding. Most existing methods rely on uniform supervision, implicitly assuming a constant difficulty distribution across layout regions. In this work, we challenge this assumption by revealing a critical flaw: \textbf{Positional Disparity}, a phenomenon where models demonstrate mastery over the deterministic start and end regions but suffer a performance collapse in the complex intermediate sections. This degradation arises because standard training allows the massive volume of easy patterns to drown out the learning signals from difficult layouts. To address this, we propose \textbf{FocalOrder}, a framework driven by \textbf{Focal Preference Optimization (FPO)}. Specifically, FocalOrder employs adaptive difficulty discovery with exponential moving average mechanism to dynamically pinpoint hard-to-learn transitions, while introducing a difficulty-calibrated pairwise ranking objective to enforce global logical consistency. Extensive experiments demonstrate that FocalOrder establishes new state-of-the-art results on OmniDocBench v1.0 and Comp-HRDoc. Our compact model not only outperforms competitive specialized baselines but also significantly surpasses large-scale general VLMs. These results demonstrate that aligning the optimization with intrinsic structural ambiguity of documents is critical for mastering complex document structures.
翻译:阅读顺序检测是文档理解的基础。现有方法大多依赖均匀监督,隐含假设布局区域具有恒定的难度分布。本文通过揭示一个关键缺陷挑战了这一假设:\textbf{位置差异}——模型在确定性的起始和结束区域表现出色,但在复杂的中间部分却出现性能崩溃的现象。这种性能下降源于标准训练允许大量简单模式淹没来自困难布局的学习信号。为解决此问题,我们提出\textbf{FocalOrder},一个由\textbf{焦点偏好优化(FPO)}驱动的框架。具体而言,FocalOrder采用基于指数移动平均机制的自适应难度发现来动态定位难以学习的过渡区域,同时引入难度校准的成对排序目标以强制全局逻辑一致性。大量实验表明,FocalOrder在OmniDocBench v1.0和Comp-HRDoc上取得了新的最先进结果。我们的紧凑模型不仅优于竞争性的专用基线模型,还显著超越了大规模通用视觉语言模型。这些结果表明,将优化过程与文档固有的结构模糊性对齐对于掌握复杂文档结构至关重要。