This paper addresses Named Entity Recognition (NER) in the domain of Vocational Education and Training (VET), focusing on historical, digitized documents that suffer from OCR-induced noise. We propose a robust NER approach leveraging Noise-Aware Training (NAT) with synthetically injected OCR errors, transfer learning, and multi-stage fine-tuning. Three complementary strategies, training on noisy, clean, and artificial data, are systematically compared. Our method is one of the first to recognize multiple entity types in VET documents. It is applied to German documents but transferable to arbitrary languages. Experimental results demonstrate that domain-specific and noise-aware fine-tuning substantially increases robustness and accuracy under noisy conditions. We provide publicly available code for reproducible noise-aware NER in domain-specific contexts.
翻译:本文针对职业教育与培训领域中的命名实体识别任务展开研究,重点关注因光学字符识别引入噪声的历史数字化文档。我们提出一种鲁棒的NER方法,该方法利用噪声感知训练(结合合成注入的OCR错误)、迁移学习以及多阶段微调技术。系统比较了在噪声数据、干净数据及人工合成数据上进行训练的三种互补策略。我们的方法是首批在VET文档中识别多种实体类型的研究之一。该方法虽应用于德语文档,但可迁移至任意语言。实验结果表明,领域特定且噪声感知的微调能显著提升模型在噪声条件下的鲁棒性与准确性。我们公开了相关代码,以促进领域特定场景下可复现的噪声感知NER研究。