Recent advances in visual language models (VLMs) have transformed end-to-end document understanding. However, their ability to interpret the complex layout semantics of historical scholarly texts remains limited. This paper investigates structure-aware text recognition for Ancient Greek critical editions, which have dense reference hierarchies and extensive marginal annotations. We introduce two novel resources: (i) a large-scale synthetic corpus of 185,000 page images generated from TEI/XML sources with controlled typographic and layout variation, and (ii) a curated benchmark of real scanned editions spanning more than a century of editorial and typographic practices. Using these datasets, we evaluate three state-of-the-art VLMs under both zero-shot and fine-tuning regimes. Our experiments reveal substantial limitations in current VLM architectures when confronted with highly structured historical documents. In zero-shot settings, most models significantly underperform compared to established off-the-shelf software. Nevertheless, the Qwen3VL-8B model achieves state-of-the-art performance, reaching a median Character Error Rate of 1.0\% on real scans. These results highlight both the current shortcomings and the future potential of VLMs for structure-aware recognition of complex scholarly documents.
翻译:近年来,视觉语言模型(VLM)的进步彻底改变了端到端文档理解。然而,它们解读历史学术文本复杂版式语义的能力仍然有限。本文研究面向古希腊评注本的版式感知文本识别,这类文本具有密集的引用层级和广泛的页边注释。我们引入了两项新资源:(i)基于TEI/XML源生成的大型合成语料库,包含185,000页图像,具有受控的排版与版式变化;以及(ii)一个涵盖跨越一个多世纪编辑与排版实践的基准数据集,来自真实扫描版。利用这些数据集,我们在零样本和微调两种模式下评估了三个最先进的VLM。实验揭示了当前VLM架构在处理高度结构化历史文献时的显著局限性。在零样本设置下,大多数模型的性能远低于现成的传统软件。然而,Qwen3VL-8B模型达到了最优性能,在真实扫描上的中位字符错误率为1.0%。这些结果既凸显了VLM在复杂学术文档版式感知识别中的当前不足,也展现了其未来潜力。