Natural scene text detection is a significant challenge in computer vision, with tremendous potential applications in multilingual, diverse, and complex text scenarios. We propose a multilingual text detection model to address the issues of low accuracy and high difficulty in detecting multilingual text in natural scenes. In response to the challenges posed by multilingual text images with multiple character sets and various font styles, we introduce the SFM Swin Transformer feature extraction network to enhance the model's robustness in detecting characters and fonts across different languages. Dealing with the considerable variation in text scales and complex arrangements in natural scene text images, we present the AS-HRFPN feature fusion network by incorporating an Adaptive Spatial Feature Fusion module and a Spatial Pyramid Pooling module. The feature fusion network improvements enhance the model's ability to detect text sizes and orientations. Addressing diverse backgrounds and font variations in multilingual scene text images is a challenge for existing methods. Limited local receptive fields hinder detection performance. To overcome this, we propose a Global Semantic Segmentation Branch, extracting and preserving global features for more effective text detection, aligning with the need for comprehensive information. In this study, we collected and built a real-world multilingual natural scene text image dataset and conducted comprehensive experiments and analyses. The experimental results demonstrate that the proposed algorithm achieves an F-measure of 85.02\%, which is 4.71\% higher than the baseline model. We also conducted extensive cross-dataset validation on MSRA-TD500, ICDAR2017MLT, and ICDAR2015 datasets to verify the generality of our approach. The code and dataset can be found at https://github.com/wangmelon/CEMLT.
翻译:自然场景文本检测是计算机视觉领域的一项重要挑战,在多语言、多样化和复杂文本场景中具有巨大的应用潜力。针对自然场景中多语言文本检测准确率低、难度大的问题,我们提出了一种多语言文本检测模型。为应对多语言文本图像中多种字符集和不同字体样式带来的挑战,我们引入了SFM Swin Transformer特征提取网络,以增强模型对不同语言字符和字体的鲁棒性。针对自然场景文本图像中文本尺度差异大、排列复杂的问题,我们通过融入自适应空间特征融合模块和空间金字塔池化模块,提出了AS-HRFPN特征融合网络。特征融合网络的改进提升了模型对文本尺寸和方向的检测能力。现有方法难以应对多语言场景文本图像中多样的背景和字体变化,有限的局部感受野阻碍了检测性能。为此,我们提出全局语义分割分支,通过提取并保留全局特征以实现更有效的文本检测,满足全面信息的需求。在本研究中,我们收集并构建了真实多语言自然场景文本图像数据集,并进行了全面的实验与分析。实验结果表明,所提算法在F值上达到85.02%,比基线模型高出4.71%。我们还在MSRA-TD500、ICDAR2017MLT和ICDAR2015数据集上进行了广泛的跨数据集验证,以验证方法的泛化性。代码和数据集可在https://github.com/wangmelon/CEMLT获取。