In recent years, the optical character recognition (OCR) field has been proliferating with plentiful cutting-edge approaches for a wide spectrum of tasks. However, these approaches are task-specifically designed with divergent paradigms, architectures, and training strategies, which significantly increases the complexity of research and maintenance and hinders the fast deployment in applications. To this end, we propose UPOCR, a simple-yet-effective generalist model for Unified Pixel-level OCR interface. Specifically, the UPOCR unifies the paradigm of diverse OCR tasks as image-to-image transformation and the architecture as a vision Transformer (ViT)-based encoder-decoder. Learnable task prompts are introduced to push the general feature representations extracted by the encoder toward task-specific spaces, endowing the decoder with task awareness. Moreover, the model training is uniformly aimed at minimizing the discrepancy between the generated and ground-truth images regardless of the inhomogeneity among tasks. Experiments are conducted on three pixel-level OCR tasks including text removal, text segmentation, and tampered text detection. Without bells and whistles, the experimental results showcase that the proposed method can simultaneously achieve state-of-the-art performance on three tasks with a unified single model, which provides valuable strategies and insights for future research on generalist OCR models. Code will be publicly available.
翻译:近年来,光学字符识别(OCR)领域随着大量针对广泛任务的先进方法而蓬勃发展。然而,这些方法采用不同范式、架构和训练策略进行特定任务设计,显著增加了研究和维护的复杂性,并阻碍了应用中的快速部署。为此,我们提出UPOCR——一种简单而高效的统一像素级OCR接口通用模型。具体而言,UPOCR将多样OCR任务的范式统一为图像到图像的转换,架构统一为基于视觉Transformer(ViT)的编码器-解码器。我们引入可学习任务提示,将编码器提取的通用特征表示推向特定任务空间,赋予解码器任务感知能力。此外,模型训练统一以最小化生成图像与真实图像之间的差异为目标,而不受任务间异质性的影响。我们在包括文本去除、文本分割和篡改文本检测在内的三个像素级OCR任务上进行实验。无需复杂设计,实验结果表明,所提方法能够凭借统一单模型在三项任务上同时实现最先进性能,为未来通用OCR模型的研究提供了有价值的策略和见解。代码将公开提供。