End-to-end text spotting is a vital computer vision task that aims to integrate scene text detection and recognition into a unified framework. Typical methods heavily rely on Region-of-Interest (RoI) operations to extract local features and complex post-processing steps to produce final predictions. To address these limitations, we propose TextFormer, a query-based end-to-end text spotter with Transformer architecture. Specifically, using query embedding per text instance, TextFormer builds upon an image encoder and a text decoder to learn a joint semantic understanding for multi-task modeling. It allows for mutual training and optimization of classification, segmentation, and recognition branches, resulting in deeper feature sharing without sacrificing flexibility or simplicity. Additionally, we design an Adaptive Global aGgregation (AGG) module to transfer global features into sequential features for reading arbitrarily-shaped texts, which overcomes the sub-optimization problem of RoI operations. Furthermore, potential corpus information is utilized from weak annotations to full labels through mixed supervision, further improving text detection and end-to-end text spotting results. Extensive experiments on various bilingual (i.e., English and Chinese) benchmarks demonstrate the superiority of our method. Especially on TDA-ReCTS dataset, TextFormer surpasses the state-of-the-art method in terms of 1-NED by 13.2%.
翻译:端到端文本定位是一项重要的计算机视觉任务,旨在将场景文本检测与识别集成到统一框架中。典型方法严重依赖感兴趣区域(RoI)操作来提取局部特征,并需要复杂的后处理步骤来生成最终预测。为解决这些局限,我们提出TextFormer,一种基于Transformer架构的查询式端到端文本定位器。具体而言,TextFormer通过为每个文本实例使用查询嵌入,基于图像编码器和文本解码器学习多任务建模的联合语义理解。这使得分类、分割和识别分支能够相互训练和优化,在保持灵活性和简洁性的同时实现更深层次的特征共享。此外,我们设计了一种自适应全局聚合(AGG)模块,将全局特征转换为序列特征以读取任意形状的文本,从而克服了RoI操作的次优问题。通过混合监督利用从弱标注到完整标签的潜在语料信息,进一步提升了文本检测和端到端文本定位结果。在多种双语(即英文和中文)基准上的大量实验证明了我们方法的优越性。特别是在TDA-ReCTS数据集上,TextFormer在1-NED指标上超越了当前最优方法13.2%。