Mathematical language in scientific communications and educational scenarios is important yet relatively understudied compared to natural languages. Recent works on mathematical language focus either on representing stand-alone mathematical expressions, especially in their natural tree format, or mathematical reasoning in pre-trained natural language models. Existing works on jointly modeling and generating natural and mathematical languages simply treat mathematical expressions as text, without accounting for the rigid structural properties of mathematical expressions. In this paper, we propose a series of modifications to existing language models to jointly represent and generate text and math: representing mathematical expressions as sequences of node tokens in their operator tree format, using math symbol and tree position embeddings to preserve the semantic and structural properties of mathematical expressions, and using a constrained decoding method to generate mathematically valid expressions. We ground our modifications in GPT-2, resulting in a model MathGPT, and demonstrate that it outperforms baselines on mathematical expression generation tasks.
翻译:科学交流和教学场景中的数学语言非常重要,但相较于自然语言研究仍相对不足。近期关于数学语言的研究主要关注独立数学表达式的表示(尤其是其自然树形格式),或预训练自然语言模型中的数学推理。现有联合建模与生成自然语言和数学语言的工作仅将数学表达式视为文本,未考虑数学表达式的严格结构特性。本文提出对现有语言模型进行一系列改进,以实现文本与数学的联合表示与生成:将数学表达式表示为操作符树格式的节点标记序列,通过数学符号与树位置嵌入保留数学表达式的语义与结构属性,并采用约束解码方法生成数学上有效的表达式。我们将这些改进应用于GPT-2,构建了MathGPT模型,实验表明该模型在数学表达式生成任务上优于基线方法。