Keyword spotting (KWS) in historical documents is an important tool for the initial exploration of digitized collections. Nowadays, the most efficient KWS methods are relying on machine learning techniques that require a large amount of annotated training data. However, in the case of historical manuscripts, there is a lack of annotated corpus for training. To handle the data scarcity issue, we investigate the merits of the self-supervised learning to extract useful representations of the input data without relying on human annotations and then using these representations in the downstream task. We propose ST-KeyS, a masked auto-encoder model based on vision transformers where the pretraining stage is based on the mask-and-predict paradigm, without the need of labeled data. In the fine-tuning stage, the pre-trained encoder is integrated into a siamese neural network model that is fine-tuned to improve feature embedding from the input images. We further improve the image representation using pyramidal histogram of characters (PHOC) embedding to create and exploit an intermediate representation of images based on text attributes. In an exhaustive experimental evaluation on three widely used benchmark datasets (Botany, Alvermann Konzilsprotokolle and George Washington), the proposed approach outperforms state-of-the-art methods trained on the same datasets.
翻译:关键词定位是历史文献数字化集合初步探索的重要工具。当前最高效的关键词定位方法依赖机器学习技术,需要大量带标注的训练数据。然而,针对历史手稿,标注语料库的匮乏制约了模型训练。为解决数据稀缺问题,我们探究了自监督学习的优势——无需人工标注即可提取输入数据中的有效表征,并将这些表征用于下游任务。我们提出ST-KeyS,一种基于视觉Transformer的掩码自编码器模型,其预训练阶段基于“掩码-预测”范式,无需标注数据。在微调阶段,预训练编码器被集成到孪生神经网络模型中,通过微调优化输入图像的特征嵌入。进一步地,我们利用金字塔状字符直方图(PHOC)嵌入改进图像表征,基于文本属性构建并利用图像的中间表征。在三个广泛使用的基准数据集(Botany、Alvermann Konzilsprotokolle和George Washington)上的详尽实验评估表明,所提方法在相同数据集上优于当前最先进的方法。