We investigate document page ordering on 5,461 shuffled WOO documents (Dutch freedom of information releases) using page embeddings. These documents are heterogeneous collections such as emails, legal texts, and spreadsheets compiled into single PDFs, where semantic ordering signals are unreliable. We compare five methods, including pointer networks, seq2seq transformers, and specialized pairwise ranking models. The best performing approach successfully reorders documents up to 15 pages, with Kendall's tau ranging from 0.95 for short documents (2-5 pages) to 0.72 for 15 page documents. We observe two unexpected failures: seq2seq transformers fail to generalize on long documents (Kendall's tau drops from 0.918 on 2-5 pages to 0.014 on 21-25 pages), and curriculum learning underperforms direct training by 39% on long documents. Ablation studies suggest learned positional encodings are one contributing factor to seq2seq failure, though the degradation persists across all encoding variants, indicating multiple interacting causes. Attention pattern analysis reveals that short and long documents require fundamentally different ordering strategies, explaining why curriculum learning fails. Model specialization achieves substantial improvements on longer documents (+0.21 tau).
翻译:本研究利用页面嵌入技术,对5,461份经过乱序处理的WOO文件(荷兰信息公开发布文件)进行文档页面排序分析。这些文件是包含电子邮件、法律文本和电子表格等异质文档集合的单一PDF文件,其语义排序信号并不可靠。我们比较了五种方法,包括指针网络、序列到序列Transformer以及专门的成对排序模型。性能最佳的方法成功地对最多15页的文档进行了重排序,其肯德尔tau系数从短文档(2-5页)的0.95到15页文档的0.72不等。我们观察到两个意外失效现象:序列到序列Transformer在长文档上泛化能力失效(肯德尔tau系数从2-5页的0.918骤降至21-25页的0.014),且课程学习在长文档上的表现比直接训练低39%。消融研究表明,学习型位置编码是序列到序列模型失效的影响因素之一,但性能下降在所有编码变体中持续存在,表明存在多重交互因素。注意力模式分析揭示,短文档与长文档需要根本不同的排序策略,这解释了课程学习失效的原因。模型专业化在较长文档上实现了显著改进(肯德尔tau系数提升0.21)。