Mathematics is a highly specialized domain with its own unique set of challenges that has seen limited study in natural language processing. However, mathematics is used in a wide variety of fields and multidisciplinary research in many different domains often relies on an understanding of mathematical concepts. To aid researchers coming from other fields, we develop a prototype system for searching for and defining mathematical concepts in context, focusing on the field of category theory. This system, Parmesan, depends on natural language processing components including concept extraction, relation extraction, definition extraction, and entity linking. In developing this system, we show that existing techniques cannot be applied directly to the category theory domain, and suggest hybrid techniques that do perform well, though we expect the system to evolve over time. We also provide two cleaned mathematical corpora that power the prototype system, which are based on journal articles and wiki pages, respectively. The corpora have been annotated with dependency trees, lemmas, and part-of-speech tags.
翻译:数学是一个高度专业化的领域,具有其独特的挑战,在自然语言处理研究中受到的关注有限。然而,数学被广泛应用于各个领域,许多跨学科研究依赖于对数学概念的理解。为帮助来自其他领域的研究人员,我们开发了一个原型系统,用于在上下文中搜索和定义数学概念,重点关注范畴论领域。该系统名为Parmesan,依赖于自然语言处理组件,包括概念提取、关系提取、定义提取和实体链接。在开发该系统的过程中,我们表明现有技术无法直接应用于范畴论领域,并提出了表现良好的混合技术,尽管我们预期该系统将随时间不断演进。我们还提供了为原型系统提供支持的两个经过清洗的数学语料库,这些语料库分别基于期刊文章和维基页面构建。语料库已标注了依存句法树、词元及词性标签。