This paper introduces a dataset of enriched geographic coordinates retrieved from Diderot and d'Alembert's eighteenth-century Encyclopedie. Automatically recovering geographic coordinates from historical texts is a complex task, as they are expressed in a variety of ways and with varying levels of precision. To improve retrieval of coordinates from similar digitized early modern texts, we have created a gold standard dataset, trained models, published the resulting inferred and normalized coordinate data, and experimented applying these models to new texts. From 74,000 total articles in each of the digitized versions of the Encyclopedie from ARTFL and ENCCRE, we examined 15,278 geographical entries, manually identifying 4,798 containing coordinates, and 10,480 with descriptive but non-numerical references. Leveraging our gold standard annotations, we trained transformer-based models to retrieve and normalize coordinates. The pipeline presented here combines a classifier to identify coordinate-bearing entries and a second model for retrieval, tested across encoder-decoder and decoder architectures. Cross-validation yielded an 86% EM score. On an out-of-domain eighteenth-century Trevoux dictionary (also in French), our fine-tuned model had a 61% EM score, while for the nineteenth-century, 7th edition of the Encyclopaedia Britannica in English, the EM was 77%. These findings highlight the gold standard dataset's usefulness as training data, and our two-step method's cross-lingual, cross-domain generalizability.
翻译:本文介绍了一个从狄德罗与达朗贝尔的十八世纪《百科全书》中提取的增强型地理坐标数据集。从历史文本中自动恢复地理坐标是一项复杂的任务,因为这些坐标的表达方式多样且精度各异。为提升从类似数字化早期现代文本中检索坐标的能力,我们创建了一个黄金标准数据集,训练了相关模型,发布了由此推断和归一化的坐标数据,并尝试将这些模型应用于新文本。在ARTFL和ENCCRE两个数字化版本的《百科全书》共计74,000篇文章中,我们审查了15,278个地理条目,人工识别出其中4,798个包含坐标的条目,以及10,480个具有描述性但无非数值参照的条目。利用我们的黄金标准标注,我们训练了基于Transformer的模型来检索和归一化坐标。本文提出的流程结合了一个用于识别含坐标条目的分类器和一个用于检索的第二个模型,并在编码器-解码器和解码器架构上进行了测试。交叉验证获得了86%的精确匹配分数。在一个域外的十八世纪特雷乌词典(同为法语)上,我们微调后的模型获得了61%的精确匹配分数;而对于十九世纪的英文第七版《不列颠百科全书》,精确匹配分数为77%。这些发现凸显了该黄金标准数据集作为训练数据的实用性,以及我们两步法在跨语言、跨领域方面的泛化能力。