Recent advances in natural language processing (NLP), particularly with the emergence of large language models (LLMs), have significantly enhanced the field of textual analysis. However, while these developments have yielded substantial progress in analyzing natural language text, applying analysis to mathematical equations and their relationships within texts has produced mixed results. This paper takes the initial steps in expanding the problem of relation extraction towards understanding the dependency relationships between mathematical expressions in STEM articles. The authors construct the Mathematical Derivation Graphs Dataset (MDGD), sourced from a random sampling of the arXiv corpus, containing an analysis of $107$ published STEM manuscripts with over $2000$ manually labeled inter-equation dependency relationships, resulting in a new object referred to as a derivation graph that summarizes the mathematical content of the manuscript. The authors exhaustively evaluate analytical and machine learning (ML) based models to assess their capability to identify and extract the derivation relationships for each article and compare the results with the ground truth. The authors show that the best tested LLMs achieve $F_1$ scores of $\sim45\%-52\%$, and attempt to improve their performance by combining them with analytic algorithms and other methods.
翻译:近年来,自然语言处理(NLP)领域,特别是随着大语言模型(LLMs)的出现,在文本分析方面取得了显著进展。然而,尽管这些进展在分析自然语言文本方面取得了实质性进步,但将其应用于分析数学方程及其在文本中的关系时,结果却好坏参半。本文迈出了将关系抽取问题扩展到理解STEM文章中数学表达式之间依赖关系的第一步。作者构建了数学推导图数据集(MDGD),其数据来源于对arXiv语料库的随机抽样,包含对$107$篇已发表的STEM手稿的分析,其中包含超过$2000$个手动标注的方程间依赖关系,从而产生了一种被称为推导图的新对象,用以总结手稿的数学内容。作者详尽地评估了基于分析方法和机器学习(ML)的模型,以评估它们识别和提取每篇文章推导关系的能力,并将结果与真实标注进行比较。作者表明,测试中表现最佳的LLMs达到了$F_1$分数约$\sim45\%-52\%$,并尝试通过将其与分析算法及其他方法相结合来提高其性能。