Scientific knowledge is predominantly stored in books and scientific journals, often in the form of PDFs. However, the PDF format leads to a loss of semantic information, particularly for mathematical expressions. We propose Nougat (Neural Optical Understanding for Academic Documents), a Visual Transformer model that performs an Optical Character Recognition (OCR) task for processing scientific documents into a markup language, and demonstrate the effectiveness of our model on a new dataset of scientific documents. The proposed approach offers a promising solution to enhance the accessibility of scientific knowledge in the digital age, by bridging the gap between human-readable documents and machine-readable text. We release the models and code to accelerate future work on scientific text recognition.
翻译:科学知识主要存储在书籍和科学期刊中,通常以PDF格式呈现。然而,PDF格式会导致语义信息的丢失,尤其是数学表达式。我们提出了Nougat(神经光学理解用于学术文档),这是一种视觉变换器模型,用于执行光学字符识别(OCR)任务,将科学文档处理为标记语言,并在一个新整理的学术文档数据集上展示了我们模型的有效性。所提出的方法通过弥合人类可读文档与机器可读文本之间的差距,为增强数字时代科学知识的可访问性提供了一种有前景的解决方案。我们公开发布了模型和代码,以加速未来在科学文本识别领域的研究工作。