Scene Text Recognition (STR) is challenging in extracting effective character representations from visual data when text is unreadable. Permutation language modeling (PLM) is introduced to refine character predictions by jointly capturing contextual and visual information. However, in PLM, the use of random permutations causes training fit oscillation, and the iterative refinement (IR) operation also introduces additional overhead. To address these issues, this paper proposes the Hierarchical Attention autoregressive Model with Adaptive Permutation (HAAP) to enhance position-context-image interaction capability, improving autoregressive LM generalization. First, we propose Implicit Permutation Neurons (IPN) to generate adaptive attention masks that dynamically exploit token dependencies, enhancing the correlation between visual information and context. Adaptive correlation representation helps the model avoid training fit oscillation. Second, the Cross-modal Hierarchical Attention mechanism (CHA) is introduced to capture the dependencies among position queries, contextual semantics and visual information. CHA enables position tokens to aggregate global semantic information, avoiding the need for IR. Extensive experimental results show that the proposed HAAP achieves state-of-the-art (SOTA) performance in terms of accuracy, complexity, and latency on several datasets.
翻译:场景文本识别(STR)在文本不可读时,从视觉数据中提取有效字符表示具有挑战性。排列语言建模(PLM)被引入,通过联合捕获上下文和视觉信息来细化字符预测。然而,在PLM中,随机排列的使用会导致训练拟合振荡,而迭代细化(IR)操作也会引入额外开销。为解决这些问题,本文提出了具有自适应排列的分层注意力自回归模型(HAAP),以增强位置-上下文-图像的交互能力,从而提升自回归语言模型的泛化性能。首先,我们提出隐式排列神经元(IPN)来生成自适应注意力掩码,动态利用令牌间的依赖关系,从而增强视觉信息与上下文之间的关联性。自适应关联表示有助于模型避免训练拟合振荡。其次,引入了跨模态分层注意力机制(CHA),以捕获位置查询、上下文语义和视觉信息之间的依赖关系。CHA使位置令牌能够聚合全局语义信息,从而避免了IR操作的需求。大量实验结果表明,所提出的HAAP在多个数据集上的准确率、复杂度和延迟方面均达到了最先进的(SOTA)性能。