Scholarly articles in mathematical fields feature mathematical statements such as theorems, propositions, etc., as well as their proofs. Extracting them from the PDF representation of the articles requires understanding of scientific text along with visual and font-based indicators. We pose this problem as a multimodal classification problem using text, font features, and bitmap image rendering of the PDF as different modalities. In this paper we propose a multimodal machine learning approach for extraction of theorem-like environments and proofs, based on late fusion of features extracted by individual unimodal classifiers, taking into account the sequential succession of blocks in the document. For the text modality, we pretrain a new language model on a 11 GB scientific corpus; experiments shows similar performance for our task than a model (RoBERTa) pretrained on 160 GB, with faster convergence while requiring much less fine-tuning data. Font-based information relies on training a 128-cell LSTM on the sequence of font names and sizes within each block. Bitmap renderings are dealt with using an EfficientNetv2 deep network tuned to classify each image block. Finally, a simple CRF-based approach uses the features of the multimodal model along with information on block sequences. Experimental results show the benefits of using a multimodal approach vs any single modality, as well as major performance improvements using the CRF modeling of block sequences.
翻译:数学领域的学术文章包含定理、命题等数学陈述及其证明。从文章的PDF表示中提取这些内容需要理解科学文本以及基于视觉和字体的指示符。我们将此问题定义为多模态分类问题,使用文本、字体特征以及PDF的位图图像渲染作为不同模态。本文提出一种基于多模态机器学习的方法,用于抽取类定理环境与证明,采用由单模态分类器提取特征的后期融合,并考虑文档中块的顺序序列。对于文本模态,我们在一个11GB的科学语料库上预训练了一个新语言模型;实验表明,该模型在任务上性能与在160GB数据上预训练的RoBERTa模型相当,但收敛速度更快且所需微调数据更少。基于字体的信息依赖于在每个块内字体名称和大小序列上训练的128单元LSTM。位图渲染通过调整后的EfficientNetv2深度网络处理,用于对每个图像块进行分类。最后,一种简单的基于CRF的方法结合多模态模型的特征与块序列信息。实验结果表明,使用多模态方法优于任何单一模态,并且通过CRF建模块序列可显著提升性能。