We address the extraction of mathematical statements and their proofs from scholarly PDF articles as a multimodal classification problem, utilizing text, font features, and bitmap image renderings of PDFs as distinct modalities. We propose a modular sequential multimodal machine learning approach specifically designed for extracting theorem-like environments and proofs. This is based on a cross-modal attention mechanism to generate multimodal paragraph embeddings, which are then fed into our novel multimodal sliding window transformer architecture to capture sequential information across paragraphs. Our document AI methodology stands out as it eliminates the need for OCR preprocessing, LaTeX sources during inference, or custom pre-training on specialized losses to understand cross-modality relationships. Unlike many conventional approaches that operate at a single-page level, ours can be directly applied to multi-page PDFs and seamlessly handles the page breaks often found in lengthy scientific mathematical documents. Our approach demonstrates performance improvements obtained by transitioning from unimodality to multimodality, and finally by incorporating sequential modeling over paragraphs.
翻译:我们将从学术PDF文章中提取数学陈述及其证明视为一个多模态分类问题,利用文本、字体特征和PDF位图渲染作为不同的模态。我们提出了一种专门设计用于提取类定理环境与证明的模块化序列多模态机器学习方法。该方法基于跨模态注意力机制生成多模态段落嵌入,随后输入我们新颖的多模态滑动窗口Transformer架构,以捕捉跨段落的序列信息。我们的文档AI方法具有显著优势,它无需OCR预处理、推理时的LaTeX源码,也无需通过专门设计的损失函数进行自定义预训练来理解跨模态关系。与许多仅在单页层面操作的传统方法不同,我们的方法可直接应用于多页PDF,并能无缝处理长篇科学数学文档中常见的页面中断问题。我们的方法展示了从单模态过渡到多模态,最终通过融入段落序列建模所获得的性能提升。