Mathematical notation makes up a large portion of STEM literature, yet finding semantic representations for formulae remains a challenging problem. Because mathematical notation is precise, and its meaning changes significantly with small character shifts, the methods that work for natural text do not necessarily work well for mathematical expressions. This work describes an approach for representing mathematical expressions in a continuous vector space. We use the encoder of a sequence-to-sequence architecture, trained on visually different but mathematically equivalent expressions, to generate vector representations (or embeddings). We compare this approach with a structural approach that considers visual layout to embed an expression and show that our proposed approach is better at capturing mathematical semantics. Finally, to expedite future research, we publish a corpus of equivalent transcendental and algebraic expression pairs.
翻译:数学符号构成了STEM文献的很大一部分,但为公式寻找语义表示仍然是一个具有挑战性的问题。由于数学符号具有精确性,且其含义会因字符的微小变动而发生显著变化,因此适用于自然文本的方法未必适用于数学表达式。本文描述了一种在连续向量空间中表示数学表达式的方法。我们使用序列到序列架构的编码器,该编码器在视觉不同但数学等价的表达式上进行训练,以生成向量表示(或嵌入)。我们将该方法与考虑视觉布局来嵌入表达式的结构化方法进行比较,结果表明我们提出的方法在捕获数学语义方面更优。最后,为了促进未来研究,我们发布了超越函数与代数表达式等值对语料库。