TopoBERT: Plug and Play Toponym Recognition Module Harnessing Fine-tuned BERT

Extracting precise geographical information from textual contents is crucial in a plethora of applications. For example, during hazardous events, a robust and unbiased toponym extraction framework can provide an avenue to tie the location concerned to the topic discussed by news media posts and pinpoint humanitarian help requests or damage reports from social media. Early studies have leveraged rule-based, gazetteer-based, deep learning, and hybrid approaches to address this problem. However, the performance of existing tools is deficient in supporting operations like emergency rescue, which relies on fine-grained, accurate geographic information. The emerging pretrained language models can better capture the underlying characteristics of text information, including place names, offering a promising pathway to optimize toponym recognition to underpin practical applications. In this paper, TopoBERT, a toponym recognition module based on a one dimensional Convolutional Neural Network (CNN1D) and Bidirectional Encoder Representation from Transformers (BERT), is proposed and fine-tuned. Three datasets (CoNLL2003-Train, Wikipedia3000, WNUT2017) are leveraged to tune the hyperparameters, discover the best training strategy, and train the model. Another two datasets (CoNLL2003-Test and Harvey2017) are used to evaluate the performance. Three distinguished classifiers, linear, multi-layer perceptron, and CNN1D, are benchmarked to determine the optimal model architecture. TopoBERT achieves state-of-the-art performance (f1-score=0.865) compared to the other five baseline models and can be applied to diverse toponym recognition tasks without additional training.

翻译：从文本内容中提取精确的地理信息在众多应用中至关重要。例如，在灾害事件中，一个稳健且无偏的地名提取框架能够为将新闻媒体帖子讨论的话题与相关地点建立联系提供途径，并精准定位来自社交媒体的求助请求或灾情报告。早期研究采用基于规则、基于地名词典、深度学习及混合方法来解决这一问题。然而，现有工具的性能在支持需依赖细粒度准确地理信息的应急救援等行动时存在明显不足。新兴的预训练语言模型能更好地捕捉文本信息（包括地名）的潜在特征，为优化地名识别以支撑实际应用提供了有前景的途径。本文提出并微调了TopoBERT——一种基于一维卷积神经网络（CNN1D）和双向编码器表示（BERT）的地名识别模块。我们利用三个数据集（CoNLL2003-Train、Wikipedia3000、WNUT2017）来调整超参数、探索最优训练策略并训练模型，另用两个数据集（CoNLL2003-Test和Harvey2017）评估性能。我们对线性分类器、多层感知机及CNN1D三种分类器进行了基准测试，以确定最佳模型架构。与其他五种基线模型相比，TopoBERT实现了最优性能（F1分数=0.865），并且无需额外训练即可应用于各种地名识别任务。