Citation-based Information Retrieval (IR) methods for scientific documents have proven effective for IR applications, such as Plagiarism Detection or Literature Recommender Systems in academic disciplines that use many references. In science, technology, engineering, and mathematics, researchers often employ mathematical concepts through formula notation to refer to prior knowledge. Our long-term goal is to generalize citation-based IR methods and apply this generalized method to both classical references and mathematical concepts. In this paper, we suggest how mathematical formulas could be cited and define a Formula Concept Retrieval task with two subtasks: Formula Concept Discovery (FCD) and Formula Concept Recognition (FCR). While FCD aims at the definition and exploration of a 'Formula Concept' that names bundled equivalent representations of a formula, FCR is designed to match a given formula to a prior assigned unique mathematical concept identifier. We present machine learning-based approaches to address the FCD and FCR tasks. We then evaluate these approaches on a standardized test collection (NTCIR arXiv dataset). Our FCD approach yields a precision of 68% for retrieving equivalent representations of frequent formulas and a recall of 72% for extracting the formula name from the surrounding text. FCD and FCR enable the citation of formulas within mathematical documents and facilitate semantic search and question answering as well as document similarity assessments for plagiarism detection or recommender systems.
翻译:基于引用的信息检索方法在科学文献中已被证明在引用密集型学科(如学术剽窃检测或文献推荐系统)中具有显著效果。在科学、技术、工程与数学领域,研究者常通过数学公式符号引用既有知识。我们的长远目标是推广基于引用的信息检索方法,将其同时应用于传统文献引用与数学概念引用。本文提出数学公式的引用机制,并定义了包含两个子任务的公式概念检索任务:公式概念发现(FCD)与公式概念识别(FCR)。其中,FCD旨在探索并定义"公式概念"——即对公式的等价表示簇进行统一命名;FCR则用于将给定公式匹配至预先分配的唯一数学概念标识符。我们提出了基于机器学习的方法解决FCD与FCR任务,并在标准化测试集(NTCIR arXiv数据集)上进行评估。FCD方法在检索高频公式的等价表示时取得68%的精确率,从上下文文本中提取公式名称的召回率达到72%。FCD与FCR技术使数学文档中的公式引用成为可能,可支撑语义搜索、问答系统、以及用于剽窃检测或推荐系统的文档相似度评估。