The remarkable success of GPT models across various tasks, including toponymy recognition motivates us to assess the performance of the GPT-3 model in the geocoding address parsing task. To ensure that the evaluation more accurately mirrors performance in real-world scenarios with diverse user input qualities and resolve the pressing need for a 'gold standard' evaluation dataset for geocoding systems, we introduce a benchmark dataset of low-quality address descriptions synthesized based on human input patterns mining from actual input logs of a geocoding system in production. This dataset has 21 different input errors and variations; contains over 239,000 address records that are uniquely selected from streets across all U.S. 50 states and D.C.; and consists of three subsets to be used as training, validation, and testing sets. Building on this, we train and gauge the performance of the GPT-3 model in extracting address components, contrasting its performance with transformer-based and LSTM-based models. The evaluation results indicate that Bidirectional LSTM-CRF model has achieved the best performance over these transformer-based models and GPT-3 model. Transformer-based models demonstrate very comparable results compared to the Bidirectional LSTM-CRF model. The GPT-3 model, though trailing in performance, showcases potential in the address parsing task with few-shot examples, exhibiting room for improvement with additional fine-tuning. We open source the code and data of this presented benchmark so that researchers can utilize it for future model development or extend it to evaluate similar tasks, such as document geocoding.
翻译:GPT模型在各种任务中的卓越表现(包括地名识别)促使我们评估GPT-3模型在地理编码地址解析任务中的性能。为确保评估更准确反映真实场景中不同用户输入质量下的表现,并解决地理编码系统对"金标准"评估数据集的迫切需求,我们基于对生产环境中地理编码系统实际输入日志的用户输入模式挖掘,构建了一个低质量地址描述的综合基准数据集。该数据集包含21种不同类型的输入错误与变体,涵盖从美国50州及哥伦比亚特区街道中唯一选取的超过239,000条地址记录,并划分为训练集、验证集和测试集三个子集。基于此,我们训练并评估了GPT-3模型在地址成分提取中的性能,将其与基于Transformer和LSTM的模型进行对比分析。评估结果表明,双向LSTM-CRF模型的表现优于Transformer类模型和GPT-3模型。Transformer类模型展现出与双向LSTM-CRF模型非常接近的结果。GPT-3模型虽性能稍逊,但通过少量样本示例在地址解析任务中展现出潜力,表明通过额外微调仍有改进空间。我们开源了该基准测试的代码与数据,以便研究人员将其用于未来模型开发或扩展至文档地理编码等相似任务的评估。