Handwritten Mathematical Expression Recognition (HMER) methods have made remarkable progress, with most existing HMER approaches based on either a hybrid CNN/RNN-based with GRU architecture or Transformer architectures. Each of these has its strengths and weaknesses. Leveraging different model structures as viewers and effectively integrating their diverse capabilities presents an intriguing avenue for exploration. This involves addressing two key challenges: 1) How to fuse these two methods effectively, and 2) How to achieve higher performance under an appropriate level of complexity. This paper proposes an efficient CNN-Transformer multi-viewer, multi-task approach to enhance the model's recognition performance. Our MMHMER model achieves 63.96%, 62.51%, and 65.46% ExpRate on CROHME14, CROHME16, and CROHME19, outperforming Posformer with an absolute gain of 1.28%, 1.48%, and 0.58%. The main contribution of our approach is that we propose a new multi-view, multi-task framework that can effectively integrate the strengths of CNN and Transformer. By leveraging the feature extraction capabilities of CNN and the sequence modeling capabilities of Transformer, our model can better handle the complexity of handwritten mathematical expressions.
翻译:手写数学表达式识别(HMER)方法已取得显著进展,现有大多数HMER方法主要基于混合CNN/RNN与GRU架构或Transformer架构。这些方法各有其优势与不足。利用不同模型结构作为识别视角,并有效整合其多样化能力,构成了一个值得探索的研究方向。这涉及解决两个关键挑战:1)如何有效融合这两种方法;2)如何在适当的复杂度下实现更高性能。本文提出一种高效的CNN-Transformer多视角多任务方法以提升模型识别性能。我们的MMHMER模型在CROHME14、CROHME16和CROHME19数据集上分别达到63.96%、62.51%和65.46%的表达式识别率,较Posformer模型获得1.28%、1.48%和0.58%的绝对性能提升。本研究的主要贡献在于提出了一种新的多视角多任务框架,能够有效整合CNN与Transformer的优势。通过利用CNN的特征提取能力与Transformer的序列建模能力,我们的模型能够更好地处理手写数学表达式的复杂性。