In this paper, we investigate self-supervised pre-training methods for document text recognition. Nowadays, large unlabeled datasets can be collected for many research tasks, including text recognition, but it is costly to annotate them. Therefore, methods utilizing unlabeled data are researched. We study self-supervised pre-training methods based on masked label prediction using three different approaches -- Feature Quantization, VQ-VAE, and Post-Quantized AE. We also investigate joint-embedding approaches with VICReg and NT-Xent objectives, for which we propose an image shifting technique to prevent model collapse where it relies solely on positional encoding while completely ignoring the input image. We perform our experiments on historical handwritten (Bentham) and historical printed datasets mainly to investigate the benefits of the self-supervised pre-training techniques with different amounts of annotated target domain data. We use transfer learning as strong baselines. The evaluation shows that the self-supervised pre-training on data from the target domain is very effective, but it struggles to outperform transfer learning from closely related domains. This paper is one of the first researches exploring self-supervised pre-training in document text recognition, and we believe that it will become a cornerstone for future research in this area. We made our implementation of the investigated methods publicly available at https://github.com/DCGM/pero-pretraining.
翻译:本文系统研究了文档文本识别中的自监督预训练方法。当前,包括文本识别在内的诸多研究任务可获取大量未标注数据集,但人工标注成本高昂,因此利用未标注数据的方法备受关注。我们基于掩码标签预测范式,探究了三种不同方法——特征量化(Feature Quantization)、VQ-VAE和后量化自编码器(Post-Quantized AE)——的自监督预训练技术。同时,针对采用VICReg与NT-Xent目标的联合嵌入方法,提出了一种图像偏移技术以防止模型仅依赖位置编码而完全忽略输入图像的坍塌问题。实验主要基于历史手写(边沁数据集)与历史印刷数据集,系统分析了不同规模目标域标注数据下自监督预训练技术的有效性,并以迁移学习作为强基线进行对比。评估结果表明,目标域数据的自监督预训练虽效果显著,但仍难以超越紧密相关领域的迁移学习表现。本文是首个探索文档文本识别自监督预训练的前沿研究之一,有望为该领域未来研究奠定基础。相关方法实现代码已开源于https://github.com/DCGM/pero-pretraining。