Mathematical notation makes up a large portion of STEM literature, yet, finding semantic representations for formulae remains a challenging problem. Because mathematical notation is precise, and its meaning changes significantly with small character shifts, the methods that work for natural text do not necessarily work well for mathematical expressions. In this work, we describe an approach for representing mathematical expressions in a continuous vector space. We use the encoder of a sequence-to-sequence architecture, trained on visually different but mathematically equivalent expressions, to generate vector representations (or embeddings). We compare this approach with an autoencoder and show that the former is better at capturing mathematical semantics. Finally, to expedite future research, we publish a corpus of equivalent transcendental and algebraic expression pairs.
翻译:数学符号在STEM文献中占有很大比重,然而,为公式寻找语义表示仍然是一个具有挑战性的问题。由于数学符号具有精确性,且字符微小变化会导致语义显著改变,因此适用于自然文本的方法未必能很好地处理数学表达式。本文描述了一种在连续向量空间中表示数学表达式的方法。我们使用序列到序列架构的编码器(该编码器基于视觉不同但数学等价的表达式训练)来生成向量表示(或称嵌入)。我们将此方法与自编码器进行对比,证明前者在捕捉数学语义方面表现更优。最后,为促进未来研究,我们发布了一个包含等价超越函数与代数表达式对的数据集。