Citation-based Information Retrieval (IR) methods for scientific documents have proven effective for IR applications, such as Plagiarism Detection or Literature Recommender Systems in academic disciplines that use many references. In science, technology, engineering, and mathematics, researchers often employ mathematical concepts through formula notation to refer to prior knowledge. Our long-term goal is to generalize citation-based IR methods and apply this generalized method to both classical references and mathematical concepts. In this paper, we suggest how mathematical formulas could be cited and define a Formula Concept Retrieval task with two subtasks: Formula Concept Discovery (FCD) and Formula Concept Recognition (FCR). While FCD aims at the definition and exploration of a 'Formula Concept' that names bundled equivalent representations of a formula, FCR is designed to match a given formula to a prior assigned unique mathematical concept identifier. We present machine learning-based approaches to address the FCD and FCR tasks. We then evaluate these approaches on a standardized test collection (NTCIR arXiv dataset). Our FCD approach yields a precision of 68% for retrieving equivalent representations of frequent formulas and a recall of 72% for extracting the formula name from the surrounding text. FCD and FCR enable the citation of formulas within mathematical documents and facilitate semantic search and question answering as well as document similarity assessments for plagiarism detection or recommender systems.
翻译:基于引用的信息检索方法在科学文献中已被证明对IR应用(如学术领域的抄袭检测或文献推荐系统)具有有效性。在科学、技术、工程和数学领域,研究者常通过公式符号引用数学概念来指代已有知识。我们的长期目标是推广基于引用的IR方法,并将其应用于经典参考文献与数学概念。本文提出数学公式的引用方式,并定义了一个包含两个子任务的公式概念检索任务:公式概念发现(FCD)与公式概念识别(FCR)。FCD旨在定义并探索可命名公式等价表示的聚合体——“公式概念”,而FCR则用于将给定公式匹配至预先分配的唯—数学概念标识符。我们提出基于机器学习的方法来解决FCD与FCR任务,并在标准化测试集(NTCIR arXiv数据集)上评估这些方法。对于高频公式的等价表示检索,我们的FCD方法精确率达到68%;从上下文文本中提取公式名称的召回率达72%。FCD与FCR技术不仅支持数学文献中的公式引用,还能促进语义搜索、问答系统,以及用于抄袭检测或推荐系统的文档相似性评估。