Context-aware methods achieved great success in supervised scene text recognition via incorporating semantic priors from words. We argue that such prior contextual information can be interpreted as the relations of textual primitives due to the heterogeneous text and background, which can provide effective self-supervised labels for representation learning. However, textual relations are restricted to the finite size of dataset due to lexical dependencies, which causes the problem of over-fitting and compromises representation robustness. To this end, we propose to enrich the textual relations via rearrangement, hierarchy and interaction, and design a unified framework called RCLSTR: Relational Contrastive Learning for Scene Text Recognition. Based on causality, we theoretically explain that three modules suppress the bias caused by the contextual prior and thus guarantee representation robustness. Experiments on representation quality show that our method outperforms state-of-the-art self-supervised STR methods. Code is available at https://github.com/ThunderVVV/RCLSTR.
翻译:基于上下文感知的方法通过融入词汇语义先验,在监督场景文字识别中取得了显著成功。我们认为,由于文本与背景的异质性,这种先验上下文信息可被解释为文本基元之间的关系,从而为表示学习提供有效的自监督标签。然而,受词典依赖限制,文本关系局限于有限的数据集规模,导致过拟合问题并损害表示鲁棒性。为此,我们提出通过重排、层次化和交互来丰富文本关系,并设计统一框架RCLSTR:面向场景文字识别的关系对比学习。基于因果理论,我们从理论上阐释三个模块能够抑制上下文先验引起的偏差,从而保证表示鲁棒性。表示质量实验表明,我们的方法优于现有最先进的自监督STR方法。代码开源地址:https://github.com/ThunderVVV/RCLSTR。