Is ChatGPT a game changer for geocoding -- a benchmark for geocoding address parsing techniques

The remarkable success of GPT models across various tasks, including toponymy recognition motivates us to assess the performance of the GPT-3 model in the geocoding address parsing task. To ensure that the evaluation more accurately mirrors performance in real-world scenarios with diverse user input qualities and resolve the pressing need for a 'gold standard' evaluation dataset for geocoding systems, we introduce a benchmark dataset of low-quality address descriptions synthesized based on human input patterns mining from actual input logs of a geocoding system in production. This dataset has 21 different input errors and variations; contains over 239,000 address records that are uniquely selected from streets across all U.S. 50 states and D.C.; and consists of three subsets to be used as training, validation, and testing sets. Building on this, we train and gauge the performance of the GPT-3 model in extracting address components, contrasting its performance with transformer-based and LSTM-based models. The evaluation results indicate that Bidirectional LSTM-CRF model has achieved the best performance over these transformer-based models and GPT-3 model. Transformer-based models demonstrate very comparable results compared to the Bidirectional LSTM-CRF model. The GPT-3 model, though trailing in performance, showcases potential in the address parsing task with few-shot examples, exhibiting room for improvement with additional fine-tuning. We open source the code and data of this presented benchmark so that researchers can utilize it for future model development or extend it to evaluate similar tasks, such as document geocoding.

翻译：GPT模型在各种任务（包括地名识别）上的显著成功，促使我们评估GPT-3模型在地理编码地址解析任务中的性能。为确保评估更准确反映真实场景中用户输入质量多样化的性能，并解决地理编码系统对“黄金标准”评估数据集的迫切需求，我们引入了一个基于生产环境中地理编码系统实际输入日志挖掘合成的人为输入模式低质量地址描述基准数据集。该数据集包含21种不同的输入错误与变体，涵盖超过23.9万条地址记录，这些记录选自美国50个州及哥伦比亚特区各街道，并由三个子集组成，分别用于训练、验证和测试。在此基础上，我们训练并评估了GPT-3模型在提取地址组件方面的性能，将其性能与基于Transformer和基于LSTM的模型进行对比。评估结果表明，双向LSTM-CRF模型在性能上优于基于Transformer的模型和GPT-3模型。基于Transformer的模型表现与双向LSTM-CRF模型非常接近。尽管GPT-3模型性能稍逊，但在少量示例下展示了地址解析任务的潜力，并通过进一步微调仍有改进空间。我们开源了本基准测试的代码和数据，以便研究人员将其用于未来模型开发，或扩展至评估类似任务（如文档地理编码）。