When used in a real-world noisy environment, the capacity to generalize to multiple domains is essential for any autonomous scene text spotting system. However, existing state-of-the-art methods employ pretraining and fine-tuning strategies on natural scene datasets, which do not exploit the feature interaction across other complex domains. In this work, we explore and investigate the problem of domain-agnostic scene text spotting, i.e., training a model on multi-domain source data such that it can directly generalize to target domains rather than being specialized for a specific domain or scenario. In this regard, we present the community a text spotting validation benchmark called Under-Water Text (UWT) for noisy underwater scenes to establish an important case study. Moreover, we also design an efficient super-resolution based end-to-end transformer baseline called DA-TextSpotter which achieves comparable or superior performance over existing text spotting architectures for both regular and arbitrary-shaped scene text spotting benchmarks in terms of both accuracy and model efficiency. The dataset, code and pre-trained models will be released upon acceptance.
翻译:在现实世界的噪声环境中,对多域的泛化能力是任何自主场景文本检测系统的基本要求。然而,现有最先进的方法在自然场景数据集上采用预训练和微调策略,未能利用跨其他复杂领域的特征交互。本文探索并研究了域无关场景文本检测问题,即在多域源数据上训练模型,使其能直接泛化至目标域,而非专门针对特定域或场景。为此,我们向学界推出名为水下文本(UWT)的文本检测验证基准,针对噪声水下场景建立重要案例研究。同时,我们设计了一种基于超分辨率的高效端到端Transformer基线模型DA-TextSpotter,在规则和不规则形状的场景文本检测基准上,其在准确率和模型效率方面均达到或超越现有文本检测架构的表现。数据集、代码和预训练模型将在录用后发布。