In this paper, we fine-tuned three pre-trained BERT models on the task of "definition extraction" from mathematical English written in LaTeX. This is presented as a binary classification problem, where either a sentence contains a definition of a mathematical term or it does not. We used two original data sets, "Chicago" and "TAC," to fine-tune and test these models. We also tested on WFMALL, a dataset presented by Vanetik and Litvak in 2021 and compared the performance of our models to theirs. We found that a high-performance Sentence-BERT transformer model performed best based on overall accuracy, recall, and precision metrics, achieving comparable results to the earlier models with less computational effort.
翻译:本文针对从LaTeX编写的数学英语文本中抽取“定义”的任务,对三种预训练的BERT模型进行了微调。该任务被构建为一个二分类问题,即判断一个句子是否包含数学术语的定义。我们使用了两个原始数据集“Chicago”和“TAC”来微调和测试这些模型。同时,我们在Vanetik与Litvak于2021年提出的WFMALL数据集上进行了测试,并将我们模型的性能与他们的模型进行了比较。研究发现,基于整体准确率、召回率和精确率指标,一个高性能的Sentence-BERT Transformer模型表现最佳,在计算量更少的情况下取得了与早期模型相当的结果。