Handwritten mathematical expression recognition (HMER) is challenging in image-to-text tasks due to the complex layouts of mathematical expressions and suffers from problems including over-parsing and under-parsing. To solve these, previous HMER methods improve the attention mechanism by utilizing historical alignment information. However, this approach has limitations in addressing under-parsing since it cannot correct the erroneous attention on image areas that should be parsed at subsequent decoding steps. This faulty attention causes the attention module to incorporate future context into the current decoding step, thereby confusing the alignment process. To address this issue, we propose an attention guidance mechanism to explicitly suppress attention weights in irrelevant areas and enhance the appropriate ones, thereby inhibiting access to information outside the intended context. Depending on the type of attention guidance, we devise two complementary approaches to refine attention weights: self-guidance that coordinates attention of multiple heads and neighbor-guidance that integrates attention from adjacent time steps. Experiments show that our method outperforms existing state-of-the-art methods, achieving expression recognition rates of 60.75% / 61.81% / 63.30% on the CROHME 2014/ 2016/ 2019 datasets.
翻译:手写数学表达式识别(HMER)在图像到文本任务中具有挑战性,这源于数学表达式的复杂布局,并面临过度解析和欠解析等问题。为解决这些问题,以往的HMER方法通过利用历史对齐信息改进注意力机制。然而,此类方法在应对欠解析时存在局限性,因为无法纠正当前解码步骤中本应在后续步骤解析的图像区域上的错误注意力。这种错误注意力导致注意力模块将未来上下文引入当前解码步骤,从而扰乱对齐过程。针对此问题,我们提出一种注意力引导机制,显式抑制无关区域的注意力权重并增强适当权重,从而阻止对预期上下文之外信息的访问。根据注意力引导类型,我们设计了两种互补方法以精炼注意力权重:协调多头注意力的自引导,以及整合相邻时间步注意力的邻域引导。实验表明,本方法优于现有最优方法,在CROHME 2014/2016/2019数据集上分别达到60.75%/61.81%/63.30%的表达式识别率。