The adaptation capability to a wide range of domains is crucial for scene text spotting models when deployed to real-world conditions. However, existing state-of-the-art (SOTA) approaches usually incorporate scene text detection and recognition simply by pretraining on natural scene text datasets, which do not directly exploit the intermediate feature representations between multiple domains. Here, we investigate the problem of domain-adaptive scene text spotting, i.e., training a model on multi-domain source data such that it can directly adapt to target domains rather than being specialized for a specific domain or scenario. Further, we investigate a transformer baseline called Swin-TESTR to focus on solving scene-text spotting for both regular and arbitrary-shaped scene text along with an exhaustive evaluation. The results clearly demonstrate the potential of intermediate representations to achieve significant performance on text spotting benchmarks across multiple domains (e.g. language, synth-to-real, and documents). both in terms of accuracy and efficiency.
翻译:适应广泛领域的能力对于场景文字识别模型在实际部署中至关重要。然而,现有最先进方法通常仅通过在自然场景文本数据集上进行预训练来整合场景文本检测与识别,未能直接利用多领域间的中间特征表示。本文研究了领域自适应场景文字识别问题,即使用多领域源数据训练模型,使其能够直接适应目标领域,而非局限于特定场景。进一步,我们探索了名为Swin-TESTR的Transformer基线模型,专注于解决规则与任意形态场景文本的识别问题,并进行了全面评估。结果表明,中间表示在多领域文字识别基准测试(如语言、合成到真实场景及文档)中具有显著提升准确率与效率的潜力。