Benchmarking Large Language Models for Geolocating Colonial Virginia Land Grants

Virginia's seventeenth- and eighteenth-century land patents survive primarily as narrative metes-and-bounds descriptions, limiting spatial analysis. This study systematically evaluates current-generation large language models (LLMs) in converting these prose abstracts into geographically accurate latitude/longitude coordinates within a focused evaluation context. A digitized corpus of 5,471 Virginia patent abstracts (1695-1732) is released, with 43 rigorously verified test cases serving as an initial, geographically focused benchmark. Six OpenAI models across three architectures-o-series, GPT-4-class, and GPT-3.5-were tested under two paradigms: direct-to-coordinate and tool-augmented chain-of-thought invoking external geocoding APIs. Results were compared against a GIS analyst baseline, Stanford NER geoparser, Mordecai-3 neural geoparser, and a county-centroid heuristic. The top single-call model, o3-2025-04-16, achieved a mean error of 23 km (median 14 km), outperforming the median LLM (37.4 km) by 37.5%, the weakest LLM (50.3 km) by 53.5%, and external baselines by 67% (GIS analyst) and 70% (Stanford NER). A five-call ensemble further reduced errors to 19.2 km (median 12.2 km) at minimal additional cost (~USD 0.20 per grant), outperforming the median LLM by 48.7%. A patentee-name redaction ablation slightly increased error (~7%), showing reliance on textual landmark and adjacency descriptions rather than memorization. The cost-effective gpt-4o-2024-08-06 model maintained a 28 km mean error at USD 1.09 per 1,000 grants, establishing a strong cost-accuracy benchmark. External geocoding tools offer no measurable benefit in this evaluation. These findings demonstrate LLMs' potential for scalable, accurate, cost-effective historical georeferencing.

翻译：弗吉尼亚州十七至十八世纪的土地专利主要以叙述性的地界描述形式留存，这限制了空间分析。本研究在一个聚焦的评估背景下，系统评估了当前一代大型语言模型（LLMs）将这些文本摘要转换为地理上精确的经纬度坐标的能力。我们发布了一个包含5,471份弗吉尼亚专利摘要（1695-1732年）的数字化语料库，其中43个经过严格验证的测试案例作为初始的、地理聚焦的基准。我们测试了OpenAI的六种模型，涵盖三种架构——o系列、GPT-4类和GPT-3.5，并在两种范式下进行：直接生成坐标和调用外部地理编码API的工具增强思维链。结果与GIS分析师基线、Stanford NER地理解析器、Mordecai-3神经地理解析器以及一个县质心启发式方法进行了比较。表现最佳的单次调用模型o3-2025-04-16实现了23公里的平均误差（中位数14公里），比LLM中位数（37.4公里）优37.5%，比最弱LLM（50.3公里）优53.5%，比外部基线优67%（GIS分析师）和70%（Stanford NER）。五次调用集成进一步将误差降低至19.2公里（中位数12.2公里），而额外成本极低（每份授权约0.20美元），比LLM中位数优48.7%。对专利人姓名进行消融处理（即遮蔽）略微增加了误差（约7%），表明模型依赖文本中的地标和相邻描述而非记忆。具有成本效益的gpt-4o-2024-08-06模型保持了28公里的平均误差，每处理1,000份授权成本为1.09美元，确立了一个强有力的成本-准确性基准。在此评估中，外部地理编码工具未显示出可衡量的益处。这些发现证明了LLMs在可扩展、准确且具有成本效益的历史地理参照方面具有潜力。