The adaptation capability to a wide range of domains is crucial for scene text spotting models when deployed to real-world conditions. However, existing state-of-the-art (SOTA) approaches usually incorporate scene text detection and recognition simply by pretraining on natural scene text datasets, which do not directly exploit the intermediate feature representations between multiple domains. Here, we investigate the problem of domain-adaptive scene text spotting, i.e., training a model on multi-domain source data such that it can directly adapt to target domains rather than being specialized for a specific domain or scenario. Further, we investigate a transformer baseline called Swin-TESTR to focus on solving scene-text spotting for both regular and arbitrary-shaped scene text along with an exhaustive evaluation. The results clearly demonstrate the potential of intermediate representations to achieve significant performance on text spotting benchmarks across multiple domains (e.g. language, synth-to-real, and documents). both in terms of accuracy and efficiency.
翻译:场景文本识别模型在实际部署时面临的关键挑战在于其跨域适应能力。当前最先进方法通常仅通过自然场景文本数据集预训练来简单组合文本检测与识别模块,未能充分挖掘多域中间特征表征的潜力。本研究聚焦域自适应场景文本识别问题,旨在利用多域源数据训练模型,使其能直接适应目标域而非固守特定场景。我们以Swin-TESTR为Transformer基线模型,系统研究了规则与非规则场景文本的识别问题。实验结果表明,中间表征在多域(包括语言域、合成-真实域、文档域)文本识别基准测试中展现出显著性能优势,在准确性和效率两个维度均取得突破性进展。