The mapKurator System: A Complete Pipeline for Extracting and Linking Text from Historical Maps

Documents hold spatial focus and valuable locality characteristics. For example, descriptions of listings in real estate or travel blogs contain information about specific local neighborhoods. This information is valuable to characterize how humans perceive their environment. However, the first step to making use of this information is to identify the spatial focus (e.g., a city) of a document. Traditional approaches for identifying the spatial focus of a document rely on detecting and disambiguating toponyms from the document. This approach requires a vocabulary set of location phrases and ad-hoc rules, which ignore important words related to location. Recent topic modeling approaches using large language models often consider a few topics, each with broad coverage. In contrast, the spatial focus of a document can be a country, a city, or even a neighborhood, which together, is much larger than the number of topics considered in these approaches. Additionally, topic modeling methods are often applied to broad topics of news articles where context is easily distinguishable. To identify the geographic focus of a document effectively, we present a simple but effective Joint Embedding of multi-LocaLitY (JELLY), which jointly learns representations with separate encoders of document and location. JELLY significantly outperforms state-of-the-art methods for identifying spatial focus from documents from a number of sources. We also demonstrate case studies on the arithmetic of the learned representations, including identifying cities with similar locality characteristics and zero-shot learning to identify document spatial focus.

翻译：文档具有空间聚焦和宝贵的地点特征。例如，房地产列表或旅行博客中的描述包含关于特定本地社区的信息。这些信息对于刻画人类如何感知其环境具有重要价值。然而，利用这些信息的第一步是识别文档的空间聚焦（例如，一个城市）。传统的文档空间聚焦识别方法依赖于从文档中检测和消歧地名。这种方法需要一套地点短语词汇表和特定规则，但会忽略与地点相关的重要词语。近期使用大语言模型的主题建模方法通常只考虑少数几个主题，每个主题覆盖范围较广。相比之下，文档的空间聚焦可以是国家、城市甚至社区，其范围远超这些方法所考虑的主题数量。此外，主题建模方法通常应用于新闻文章等广泛主题，其上下文易于区分。为了有效识别文档的地理聚焦，我们提出了一种简单但有效的多地点联合嵌入方法（JELLY），该方法通过文档和地点的分离编码器联合学习表示。JELLY在从多种来源的文档中识别空间聚焦的任务上显著优于现有最优方法。我们还通过案例研究展示了所学表示的算术性质，包括识别具有相似地点特征的城市以及通过零样本学习实现文档空间聚焦的识别。