Thousands of users consult digital archives daily, but the information they can access is unrepresentative of the diversity of documentary history. The sequence-to-sequence architecture typically used for optical character recognition (OCR) - which jointly learns a vision and language model - is poorly extensible to low-resource document collections, as learning a language-vision model requires extensive labeled sequences and compute. This study models OCR as a character level image retrieval problem, using a contrastively trained vision encoder. Because the model only learns characters' visual features, it is more sample efficient and extensible than existing architectures, enabling accurate OCR in settings where existing solutions fail. Crucially, the model opens new avenues for community engagement in making digital history more representative of documentary history.
翻译:每天有成千上万的用户查阅数字档案,但所能获取的信息却难以反映文献历史的多样性。传统光学字符识别(OCR)中普遍采用的序列到序列架构——联合学习视觉与语言模型——在低资源文献集合中扩展性较差,因为学习语言-视觉模型需要大量标注序列和计算资源。本研究将OCR建模为字符级图像检索问题,使用对比训练的图像编码器。由于模型仅学习字符的视觉特征,其相比现有架构具有更高的样本效率和扩展性,能在现有方案失效的场景中实现精准OCR。关键在于,该模型为社区参与开辟了新途径,使数字历史能更好地反映文献历史的多样性。