Scene text recognition, as a cross-modal task involving vision and text, is an important research topic in computer vision. Most existing methods use language models to extract semantic information for optimizing visual recognition. However, the guidance of visual cues is ignored in the process of semantic mining, which limits the performance of the algorithm in recognizing irregular scene text. To tackle this issue, we propose a novel cross-modal fusion network (CMFN) for irregular scene text recognition, which incorporates visual cues into the semantic mining process. Specifically, CMFN consists of a position self-enhanced encoder, a visual recognition branch and an iterative semantic recognition branch. The position self-enhanced encoder provides character sequence position encoding for both the visual recognition branch and the iterative semantic recognition branch. The visual recognition branch carries out visual recognition based on the visual features extracted by CNN and the position encoding information provided by the position self-enhanced encoder. The iterative semantic recognition branch, which consists of a language recognition module and a cross-modal fusion gate, simulates the way that human recognizes scene text and integrates cross-modal visual cues for text recognition. The experiments demonstrate that the proposed CMFN algorithm achieves comparable performance to state-of-the-art algorithms, indicating its effectiveness.
翻译:摘要:场景文本识别作为涉及视觉与文本的跨模态任务,是计算机视觉领域的重要研究课题。现有方法大多采用语言模型提取语义信息以优化视觉识别,但在语义挖掘过程中忽略了视觉线索的引导作用,这限制了算法在识别不规则场景文本时的性能。为解决该问题,本文提出了一种新颖的跨模态融合网络(CMFN)用于不规则场景文本识别,通过将视觉线索融入语义挖掘过程。具体而言,CMFN由位置自增强编码器、视觉识别分支和迭代语义识别分支三部分组成。位置自增强编码器为视觉识别分支和迭代语义识别分支提供字符序列位置编码;视觉识别分支基于CNN提取的视觉特征及位置自增强编码器提供的位置编码信息执行视觉识别;迭代语义识别分支由语言识别模块与跨模态融合门控组成,模拟人类识别场景文本的方式,融合跨模态视觉线索进行文本识别。实验结果表明,所提出的CMFN算法取得了与当前最先进算法相当的性能,验证了其有效性。