This paper presents a specialized methodology for digitizing and segmenting mathematical documents from zbMATH Open, a comprehensive database of mathematical literature, to enhance machine processing capabilities. Currently, approximately 831,000 documents exist only in scanned volumes, which makes them not machine-processable. Furthermore, these scans often span multiple pages or share pages with other documents and incorporate diverse typesetting techniques, posing challenges for automated processing. To address these issues, we evaluate various Optical Character Recognition (OCR) tools and document separation techniques, proposing an optimized pipeline that outperforms existing approaches. Our study identifies Mathpix as the most effective OCR tool for LaTeX conversion, demonstrating superior performance based on BLEU and Edit Distance metrics. For document separation, we fine-tune generative Large Language Models (LLMs) and integrate them into a Majority Voting framework, achieving 97.5% accuracy when providing the text of the document. Additionally, our method identifies the start and end indexes for 90.6% of the test dataset, with an accuracy of 98.4% on applicable cases, resulting in an overall accuracy of 89.1% on the entire dataset. This approach surpasses traditional baselines, including regular expressions, ChatGPT-4o, and computer vision-based techniques. As a practical outcome, we process 810,977 mathematical documents into machine-readable text and extract precise document boundaries for 721,288 documents in LaTeX format. These contributions significantly improve accessibility for mathematical information retrieval systems, machine learning models, and related applications.
翻译:本文提出了一种专门的方法,用于对zbMATH Open(一个综合性的数学文献数据库)中的数学文档进行数字化和分割,以增强机器处理能力。目前,约有831,000份文档仅以扫描卷形式存在,无法进行机器处理。此外,这些扫描件通常跨越多页或与其他文档共享页面,并采用多样的排版技术,给自动化处理带来了挑战。为解决这些问题,我们评估了多种光学字符识别(OCR)工具和文档分割技术,提出了一种优于现有方法的优化流程。我们的研究确定Mathpix是用于LaTeX转换的最有效OCR工具,基于BLEU和编辑距离度量展现了卓越性能。在文档分割方面,我们微调了生成式大语言模型(LLMs)并将其集成到多数投票框架中,在提供文档文本时达到了97.5%的准确率。此外,我们的方法为90.6%的测试数据集识别了起始和结束索引,在适用案例上准确率为98.4%,从而在整个数据集上实现了89.1%的总体准确率。该方法超越了传统基线,包括正则表达式、ChatGPT-4o和基于计算机视觉的技术。作为实际成果,我们将810,977份数学文档处理为机器可读文本,并为721,288份文档提取了LaTeX格式的精确文档边界。这些贡献显著提高了数学信息检索系统、机器学习模型及相关应用的可访问性。