Vision model have gained increasing attention due to their simplicity and efficiency in Scene Text Recognition (STR) task. However, due to lacking the perception of linguistic knowledge and information, recent vision models suffer from two problems: (1) the pure vision-based query results in attention drift, which usually causes poor recognition and is summarized as linguistic insensitive drift (LID) problem in this paper. (2) the visual feature is suboptimal for the recognition in some vision-missing cases (e.g. occlusion, etc.). To address these issues, we propose a $\textbf{L}$inguistic $\textbf{P}$erception $\textbf{V}$ision model (LPV), which explores the linguistic capability of vision model for accurate text recognition. To alleviate the LID problem, we introduce a Cascade Position Attention (CPA) mechanism that obtains high-quality and accurate attention maps through step-wise optimization and linguistic information mining. Furthermore, a Global Linguistic Reconstruction Module (GLRM) is proposed to improve the representation of visual features by perceiving the linguistic information in the visual space, which gradually converts visual features into semantically rich ones during the cascade process. Different from previous methods, our method obtains SOTA results while keeping low complexity (92.4% accuracy with only 8.11M parameters). Code is available at $\href{https://github.com/CyrilSterling/LPV}{https://github.com/CyrilSterling/LPV}$.
翻译:视觉模型因其在场景文本识别任务中的简洁性和高效性而受到广泛关注。然而,由于缺乏语言知识与信息的感知能力,近期视觉模型面临两个问题:(1)纯视觉驱动的查询结果导致注意力漂移,通常造成识别效果不佳,本文将其归纳为语言不敏感漂移问题;(2)在部分视觉缺失情形下,视觉特征对识别而言并非最优。为解决这些问题,我们提出一种语言感知视觉模型,该模型挖掘视觉模型的语言能力以实现精准文本识别。为缓解LID问题,我们引入级联位置注意力机制,通过逐步优化与语言信息挖掘获得高质量且精准的注意力图。此外,我们提出全局语言重构模块,通过在视觉空间中感知语言信息来增强视觉特征表示,该模块在级联过程中逐步将视觉特征转化为语义丰富的特征。与先前方法不同,本方法在保持低复杂度的同时取得了最优性能(准确率92.4%,参数量仅8.11M)。代码开源于https://github.com/CyrilSterling/LPV。