Language identification describes the task of recognizing the language of written text in documents. This information is crucial because it can be used to support the analysis of a document's vocabulary and context. Supervised learning methods in recent years have advanced the task of language identification. However, these methods usually require large labeled datasets, which often need to be included for various domains of images, such as documents or scene images. In this work, we propose DocLangID, a transfer learning approach to identify the language of unlabeled historical documents. We achieve this by first leveraging labeled data from a different but related domain of historical documents. Secondly, we implement a distance-based few-shot learning approach to adapt a convolutional neural network to new languages of the unlabeled dataset. By introducing small amounts of manually labeled examples from the set of unlabeled images, our feature extractor develops a better adaptability towards new and different data distributions of historical documents. We show that such a model can be effectively fine-tuned for the unlabeled set of images by only reusing the same few-shot examples. We showcase our work across 10 languages that mostly use the Latin script. Our experiments on historical documents demonstrate that our combined approach improves the language identification performance, achieving 74% recognition accuracy on the four unseen languages of the unlabeled dataset.
翻译:语言识别描述了识别文档中书写文本语言的任务。这一信息至关重要,因为它可用于支持对文档词汇和语境的分析。近年来,监督学习方法推动了语言识别任务的发展。然而,这些方法通常需要大规模标注数据集,而这些数据集对于文档或场景图像等不同领域的图像往往难以获取。在本研究中,我们提出DocLangID,一种迁移学习方法,用于识别未标注历史文献的语言。我们首先利用来自不同但相关的历史文献领域的标注数据来实现这一点。其次,我们采用基于距离的小样本学习方法,使卷积神经网络适应未标注数据集中的新语言。通过引入来自未标注图像集合的少量人工标注样本,我们的特征提取器能更好地适应历史文献中新的且不同的数据分布。我们证明,仅通过重复使用相同的小样本示例,即可有效对未标注图像集合进行模型微调。我们在主要使用拉丁字母的10种语言上展示了研究成果。对历史文献的实验表明,我们的组合方法提升了语言识别性能,在未标注数据集的四种未见语言上达到了74%的识别准确率。