EDDA-Coordinata：一个历史地理坐标的标注数据集 (EDDA-Coordinata: An Annotated Dataset of Historical Geographic Coordinates)

This paper introduces a dataset of enriched geographic coordinates retrieved from Diderot and d'Alembert's eighteenth-century Encyclopedie. Automatically recovering geographic coordinates from historical texts is a complex task, as they are expressed in a variety of ways and with varying levels of precision. To improve retrieval of coordinates from similar digitized early modern texts, we have created a gold standard dataset, trained models, published the resulting inferred and normalized coordinate data, and experimented applying these models to new texts. From 74,000 total articles in each of the digitized versions of the Encyclopedie from ARTFL and ENCCRE, we examined 15,278 geographical entries, manually identifying 4,798 containing coordinates, and 10,480 with descriptive but non-numerical references. Leveraging our gold standard annotations, we trained transformer-based models to retrieve and normalize coordinates. The pipeline presented here combines a classifier to identify coordinate-bearing entries and a second model for retrieval, tested across encoder-decoder and decoder architectures. Cross-validation yielded an 86% EM score. On an out-of-domain eighteenth-century Trevoux dictionary (also in French), our fine-tuned model had a 61% EM score, while for the nineteenth-century, 7th edition of the Encyclopaedia Britannica in English, the EM was 77%. These findings highlight the gold standard dataset's usefulness as training data, and our two-step method's cross-lingual, cross-domain generalizability.

翻译：本文介绍了一个从狄德罗与达朗贝尔的十八世纪《百科全书》中提取的增强型地理坐标数据集。从历史文本中自动恢复地理坐标是一项复杂的任务，因为这些坐标的表达方式多样且精度各异。为提升从类似数字化早期现代文本中检索坐标的能力，我们创建了一个黄金标准数据集，训练了相关模型，发布了由此推断和归一化的坐标数据，并尝试将这些模型应用于新文本。在ARTFL和ENCCRE两个数字化版本的《百科全书》共计74,000篇文章中，我们审查了15,278个地理条目，人工识别出其中4,798个包含坐标的条目，以及10,480个具有描述性但无非数值参照的条目。利用我们的黄金标准标注，我们训练了基于Transformer的模型来检索和归一化坐标。本文提出的流程结合了一个用于识别含坐标条目的分类器和一个用于检索的第二个模型，并在编码器-解码器和解码器架构上进行了测试。交叉验证获得了86%的精确匹配分数。在一个域外的十八世纪特雷乌词典（同为法语）上，我们微调后的模型获得了61%的精确匹配分数；而对于十九世纪的英文第七版《不列颠百科全书》，精确匹配分数为77%。这些发现凸显了该黄金标准数据集作为训练数据的实用性，以及我们两步法在跨语言、跨领域方面的泛化能力。

相关内容

数据集

关注 88

数据集，又称为资料集、数据集合或资料集合，是一种由数据所组成的集合。
Data set（或dataset）是一个数据的集合，通常以表格形式出现。每一列代表一个特定变量。每一行都对应于某一成员的数据集的问题。它列出的价值观为每一个变量，如身高和体重的一个物体或价值的随机数。每个数值被称为数据资料。对应于行数，该数据集的数据可能包括一个或多个成员。

27页综述，354篇参考文献！最详尽的视觉定位综述来了

专知会员服务

21+阅读 · 2025年2月20日

【AAAI2024】Wikiformer: 利用维基百科结构化信息进行预训练，用于Ad-hoc检索

专知会员服务

19+阅读 · 2023年12月26日

《遥感》书籍三部曲！《遥感数据表征、分类和精度》、《土地资源的遥感监测、建模和制图》《水资源、灾害和城市研究的遥感》

专知会员服务

46+阅读 · 2023年3月23日

图数据库在政府中的应用，Graphs in Government Fulfilling Your Mission with Neo4j

专知会员服务

18+阅读 · 2022年4月8日