Handwritten Mathematical Expression Recognition (HMER) has wide applications in human-machine interaction scenarios, such as digitized education and automated offices. Recently, sequence-based models with encoder-decoder architectures have been commonly adopted to address this task by directly predicting LaTeX sequences of expression images. However, these methods only implicitly learn the syntax rules provided by LaTeX, which may fail to describe the position and hierarchical relationship between symbols due to complex structural relations and diverse handwriting styles. To overcome this challenge, we propose a position forest transformer (PosFormer) for HMER, which jointly optimizes two tasks: expression recognition and position recognition, to explicitly enable position-aware symbol feature representation learning. Specifically, we first design a position forest that models the mathematical expression as a forest structure and parses the relative position relationships between symbols. Without requiring extra annotations, each symbol is assigned a position identifier in the forest to denote its relative spatial position. Second, we propose an implicit attention correction module to accurately capture attention for HMER in the sequence-based decoder architecture. Extensive experiments validate the superiority of PosFormer, which consistently outperforms the state-of-the-art methods 2.03%/1.22%/2.00%, 1.83%, and 4.62% gains on the single-line CROHME 2014/2016/2019, multi-line M2E, and complex MNE datasets, respectively, with no additional latency or computational cost. Code is available at https://github.com/SJTU-DeepVisionLab/PosFormer.
翻译:手写数学表达式识别在数字化教育、自动化办公等人机交互场景中具有广泛应用。近年来,基于编码器-解码器架构的序列模型常被用于直接预测表达式图像的LaTeX序列以解决此任务。然而,这些方法仅隐式学习LaTeX提供的语法规则,由于复杂的结构关系和多样化的手写风格,可能无法准确描述符号间的位置与层次关系。为克服这一挑战,我们提出用于HMER的位置森林Transformer(PosFormer),通过联合优化表达式识别与位置识别两个任务,显式实现位置感知的符号特征表示学习。具体而言,我们首先设计位置森林,将数学表达式建模为森林结构并解析符号间的相对位置关系。在无需额外标注的情况下,每个符号被分配森林中的位置标识符以表示其相对空间位置。其次,我们提出隐式注意力校正模块,在基于序列的解码器架构中精准捕获HMER的注意力机制。大量实验验证了PosFormer的优越性,在单行CROHME 2014/2016/2019、多行M2E和复杂MNE数据集上分别持续超越现有最优方法2.03%/1.22%/2.00%、1.83%和4.62%,且未增加额外延迟或计算成本。代码公开于https://github.com/SJTU-DeepVisionLab/PosFormer。