The remarkable success of GPT models across various tasks, including toponymy recognition motivates us to assess the performance of the GPT-3 model in the geocoding address parsing task. To ensure that the evaluation more accurately mirrors performance in real-world scenarios with diverse user input qualities and resolve the pressing need for a 'gold standard' evaluation dataset for geocoding systems, we introduce a benchmark dataset of low-quality address descriptions synthesized based on human input patterns mining from actual input logs of a geocoding system in production. This dataset has 21 different input errors and variations; contains over 239,000 address records that are uniquely selected from streets across all U.S. 50 states and D.C.; and consists of three subsets to be used as training, validation, and testing sets. Building on this, we train and gauge the performance of the GPT-3 model in extracting address components, contrasting its performance with transformer-based and LSTM-based models. The evaluation results indicate that Bidirectional LSTM-CRF model has achieved the best performance over these transformer-based models and GPT-3 model. Transformer-based models demonstrate very comparable results compared to the Bidirectional LSTM-CRF model. The GPT-3 model, though trailing in performance, showcases potential in the address parsing task with few-shot examples, exhibiting room for improvement with additional fine-tuning. We open source the code and data of this presented benchmark so that researchers can utilize it for future model development or extend it to evaluate similar tasks, such as document geocoding.
翻译:GPT模型在包括地名识别在内的多种任务中取得的显著成功,促使我们评估GPT-3模型在地理编码地址解析任务中的性能。为确保评估更准确地反映真实场景中用户输入质量多样性的实际表现,并解决地理编码系统对“黄金标准”评估数据集的迫切需求,我们基于生产环境下某地理编码系统实际输入日志中的人类输入模式挖掘,构建了一个低质量地址描述基准数据集。该数据集包含21种不同的输入错误与变体,涵盖从美国50个州及华盛顿特区街道中唯一选取的超过23.9万条地址记录,并由三个子集组成,分别用于训练、验证和测试。在此基础上,我们训练并评估GPT-3模型在提取地址组件方面的性能,将其与基于Transformer和LSTM的模型进行对比。评估结果显示,双向LSTM-CRF模型在性能上优于基于Transformer的模型和GPT-3模型;基于Transformer的模型展现出与双向LSTM-CRF模型非常接近的结果;而GPT-3模型尽管性能稍逊,但在少量样本示例下展现了地址解析任务的潜力,通过进一步微调仍有提升空间。我们开源了该基准测试的代码和数据,以便研究者用于未来模型开发,或将其扩展至评估类似任务(如文档地理编码)。