Document parsing is now widely used in applications, such as large-scale document digitization, retrieval-augmented generation, and domain-specific pipelines in healthcare and education. Benchmarking these models is crucial for assessing their reliability and practical robustness. Existing benchmarks mostly target high-resource languages and provide limited coverage for low-resource settings, such as Turkish. Moreover, existing studies on Turkish document parsing lack a standardized benchmark that reflects real-world scenarios and document diversity. To address this gap, we introduce OCRTurk, a Turkish document parsing benchmark covering multiple layout elements and document categories at three difficulty levels. OCRTurk consists of 180 Turkish documents drawn from academic articles, theses, slide decks, and non-academic articles. We evaluate seven OCR models on OCRTurk using element-wise metrics. Across difficulty levels, PaddleOCR achieves the strongest overall results, leading most element-wise metrics except figures and attaining high Normalized Edit Distance scores in easy, medium, and hard subsets. We also observe performance variation by document type. Models perform well on non-academic documents, while slideshows become the most challenging.
翻译:文档解析目前广泛应用于大规模文档数字化、检索增强生成以及医疗与教育领域的特定流程等应用场景。对这些模型进行基准测试对于评估其可靠性与实际鲁棒性至关重要。现有基准主要面向高资源语言,对土耳其语等低资源场景的覆盖范围有限。此外,现有关于土耳其语文档解析的研究缺乏能够反映真实场景与文档多样性的标准化基准。为填补这一空白,我们提出了OCR土耳其——一个涵盖三种难度级别下多种版面元素与文档类别的土耳其语文档解析基准。该基准包含从学术论文、学位论文、演示文稿及非学术文章中抽取的180份土耳其语文档。我们采用元素级指标在OCR土耳其上评估了七种OCR模型。在所有难度级别中,PaddleOCR取得了最佳综合表现,除图像元素外在大多数元素级指标中领先,并在简单、中等与困难子集中均获得较高的归一化编辑距离分数。我们还观察到性能随文档类型存在差异:模型在非学术文档上表现良好,而演示文稿成为最具挑战性的文档类型。