The extraction of text in high quality is essential for text-based document analysis tasks like Document Classification or Named Entity Recognition. Unfortunately, this is not always ensured, as poor scan quality and the resulting artifacts lead to errors in the Optical Character Recognition (OCR) process. Current approaches using Convolutional Neural Networks show promising results for background removal tasks but fail correcting artifacts like pixelation or compression errors. For general images, Transformer backbones are getting integrated more frequently in well-known neural network structures for denoising tasks. In this work, a modified UNet structure using a Swin Transformer backbone is presented to remove typical artifacts in scanned documents. Multi-headed cross-attention skip connections are used to more selectively learn features in respective levels of abstraction. The performance of this approach is examined regarding compression errors, pixelation and random noise. An improvement in text extraction quality with a reduced error rate of up to 53.9% on the synthetic data is archived. The pretrained base-model can be easily adapted to new artifacts. The cross-attention skip connections allow to integrate textual information extracted from the encoder or in form of commands to more selectively control the models outcome. The latter is shown by means of an example application.
翻译:高质量文本提取对于基于文本的文档分析任务(如文档分类或命名实体识别)至关重要。然而,由于扫描质量差及其产生的伪影导致光学字符识别(OCR)过程出现错误,这一目标并非总能实现。当前基于卷积神经网络的方法在背景去除任务中展现出良好效果,但无法校正像素化或压缩错误等伪影。针对通用图像,Transformer骨干网络正越来越多地被整合到经典的神经网络结构中用于去噪任务。本文提出一种采用Swin Transformer骨干网络的改进UNet结构,用于去除扫描文档中的典型伪影。通过使用多头交叉注意力跳跃连接,能够更选择性地学习不同抽象层次的特征。本文研究了该方法在压缩错误、像素化和随机噪声方面的性能。在合成数据上,文本提取质量得到提升,错误率降低高达53.9%。预训练基础模型可便捷地适应新的伪影类型。交叉注意力跳跃连接允许整合从编码器提取的文本信息或指令形式的控制信息,从而更选择性地控制模型输出。后者通过一个示例应用进行了展示。