Large-scale and categorical-balanced text data is essential for training effective Scene Text Recognition (STR) models, which is hard to achieve when collecting real data. Synthetic data offers a cost-effective and perfectly labeled alternative. However, its performance often lags behind, revealing a significant domain gap between real and current synthetic data. In this work, we systematically analyze mainstream rendering-based synthetic datasets and identify their key limitations: insufficient diversity in corpus, font, and layout, which restricts their realism in complex scenarios. To address these issues, we introduce UnionST, a strong data engine synthesizes text covering a union of challenging samples and better aligns with the complexity observed in the wild. We then construct UnionST-S, a large-scale synthetic dataset with improved simulations in challenging scenarios. Furthermore, we develop a self-evolution learning (SEL) framework for effective real data annotation. Experiments show that models trained on UnionST-S achieve significant improvements over existing synthetic datasets. They even surpass real-data performance in certain scenarios. Moreover, when using SEL, the trained models achieve competitive performance by only seeing 9% of real data labels.
翻译:大规模且类别均衡的文本数据对于训练有效的场景文本识别(STR)模型至关重要,这在收集真实数据时难以实现。合成数据提供了一种成本效益高且标注完美的替代方案。然而,其性能往往落后于真实数据,揭示了当前合成数据与真实数据之间存在显著领域差距。在本工作中,我们系统分析了主流的基于渲染的合成数据集,并识别出其关键局限性:语料库、字体和布局的多样性不足,这限制了它们在复杂场景中的真实性。为解决这些问题,我们提出了UnionST,这是一种强大的数据引擎,能够合成覆盖一系列挑战性样本的文本,并更好地与野外观察到的复杂性对齐。随后,我们构建了UnionST-S,这是一个在挑战性场景中具有改进模拟的大规模合成数据集。此外,我们开发了一种自进化学习(SEL)框架,用于实现有效的真实数据标注。实验表明,在UnionST-S上训练的模型相较于现有合成数据集取得了显著提升,甚至在特定场景下超越了真实数据的性能。此外,当使用SEL时,训练模型仅需观察9%的真实数据标签即可达到具有竞争力的性能。