This paper presents a novel task of extracting low-resourced and noisy Latin fragments from mixed-language historical documents with varied layouts. We benchmark and evaluate the performance of large foundation models against a multimodal dataset of 724 annotated pages. The results demonstrate that reliable Latin detection with contemporary zero-shot models is achievable, yet these models lack a functional comprehension of Latin. This study establishes a comprehensive baseline for processing Latin within mixed-language corpora, supporting quantitative analysis in intellectual history and historical linguistics. Both the dataset and code are available at https://github.com/COMHIS/EACL26-detect-latin.
翻译:本文提出了一项新颖任务:从布局多样的多语言历史文献中提取资源匮乏且噪声严重的拉丁语片段。我们基于包含724个标注页面的多模态数据集,对大型基础模型的性能进行了基准测试与评估。结果表明,使用当代零样本模型能够实现可靠的拉丁语检测,但这些模型仍缺乏对拉丁语的功能性理解。本研究为处理多语言语料库中的拉丁语建立了全面基线,为思想史与历史语言学领域的量化分析提供支持。数据集与代码均发布于https://github.com/COMHIS/EACL26-detect-latin。