Scene text recognition is a rapidly developing field that faces numerous challenges due to the complexity and diversity of scene text, including complex backgrounds, diverse fonts, flexible arrangements, and accidental occlusions. In this paper, we propose a novel approach called Class-Aware Mask-guided feature refinement (CAM) to address these challenges. Our approach introduces canonical class-aware glyph masks generated from a standard font to effectively suppress background and text style noise, thereby enhancing feature discrimination. Additionally, we design a feature alignment and fusion module to incorporate the canonical mask guidance for further feature refinement for text recognition. By enhancing the alignment between the canonical mask feature and the text feature, the module ensures more effective fusion, ultimately leading to improved recognition performance. We first evaluate CAM on six standard text recognition benchmarks to demonstrate its effectiveness. Furthermore, CAM exhibits superiority over the state-of-the-art method by an average performance gain of 4.1% across six more challenging datasets, despite utilizing a smaller model size. Our study highlights the importance of incorporating canonical mask guidance and aligned feature refinement techniques for robust scene text recognition. The code is available at https://github.com/MelosY/CAM.
翻译:场景文本识别是一个快速发展的领域,由于场景文本的复杂性和多样性(包括复杂背景、多样字体、灵活排列及意外遮挡)而面临诸多挑战。本文提出一种名为类别感知掩码引导特征精化(CAM)的新方法以应对这些挑战。该方法通过从标准字体生成规范化的类别感知字形掩码,有效抑制背景与文本风格噪声,从而增强特征判别性。此外,我们设计了一个特征对齐与融合模块,通过引入规范掩码引导以实现文本识别的进一步特征精化。该模块通过增强规范掩码特征与文本特征之间的对齐,确保更有效的融合,最终提升识别性能。我们首先在六个标准文本识别基准上评估CAM,验证其有效性。进一步,在六个更具挑战性的数据集上,CAM以更小的模型规模实现了平均4.1%的性能提升,优于现有最优方法。本研究凸显了融合规范掩码引导与对齐特征精化技术对于鲁棒场景文本识别的重要性。代码开源地址:https://github.com/MelosY/CAM。