The Transformer-based encoder-decoder framework is becoming popular in scene text recognition, largely because it naturally integrates recognition clues from both visual and semantic domains. However, recent studies show that the two kinds of clues are not always well registered and therefore, feature and character might be misaligned in difficult text (e.g., with a rare shape). As a result, constraints such as character position are introduced to alleviate this problem. Despite certain success, visual and semantic are still separately modeled and they are merely loosely associated. In this paper, we propose a novel module called Multi-Domain Character Distance Perception (MDCDP) to establish a visually and semantically related position embedding. MDCDP uses the position embedding to query both visual and semantic features following the cross-attention mechanism. The two kinds of clues are fused into the position branch, generating a content-aware embedding that well perceives character spacing and orientation variants, character semantic affinities, and clues tying the two kinds of information. They are summarized as the multi-domain character distance. We develop CDistNet that stacks multiple MDCDPs to guide a gradually precise distance modeling. Thus, the feature-character alignment is well built even various recognition difficulties are presented. We verify CDistNet on ten challenging public datasets and two series of augmented datasets created by ourselves. The experiments demonstrate that CDistNet performs highly competitively. It not only ranks top-tier in standard benchmarks, but also outperforms recent popular methods by obvious margins on real and augmented datasets presenting severe text deformation, poor linguistic support, and rare character layouts. Code is available at https://github.com/simplify23/CDistNet.
翻译:基于Transformer的编码器-解码器框架在场景文本识别中日益流行,这主要得益于其能自然融合来自视觉域和语义域的识别线索。然而,近期研究表明,这两类线索并非总能良好对齐,因此在处理困难文本(如具有罕见形状的文本)时可能出现特征与字符的错配。为此,研究人员引入字符位置等约束来缓解该问题。尽管取得了一定成功,但视觉与语义仍被分别建模,且二者仅存在松散关联。本文提出一种名为多域字符距离感知(MDCDP)的新型模块,用于构建兼具视觉与语义相关性的位置嵌入。MDCDP利用位置嵌入通过交叉注意力机制查询视觉和语义特征,将两类线索融合至位置分支,生成内容感知的嵌入,该嵌入能有效感知字符间距与朝向变化、字符语义亲和度以及关联两类信息的线索,这些统称为多域字符距离。我们开发的CDistNet通过堆叠多个MDCDP逐步引导精确的距离建模,从而在面临多种识别困难时仍能建立良好的特征-字符对齐。我们在十个具有挑战性的公开数据集及两个自建增强数据集系列上验证了CDistNet的性能。实验表明,CDistNet具有高度竞争力:不仅跻身标准基准测试的顶尖行列,还在呈现严重文本形变、弱语言支撑及罕见字符布局的真实数据集和增强数据集上,以明显优势超越近期主流方法。代码已开源至https://github.com/simplify23/CDistNet。