Segmenting an address into meaningful components, also known as address parsing, is an essential step in many applications from record linkage to geocoding and package delivery. Consequently, a lot of work has been dedicated to develop accurate address parsing techniques, with machine learning and neural network methods leading the state-of-the-art scoreboard. However, most of the work on address parsing has been confined to academic endeavours with little availability of free and easy-to-use open-source solutions. This paper presents Deepparse, a Python open-source, extendable, fine-tunable address parsing solution under LGPL-3.0 licence to parse multinational addresses using state-of-the-art deep learning algorithms and evaluated on over 60 countries. It can parse addresses written in any language and use any address standard. The pre-trained model achieves average $99~\%$ parsing accuracies on the countries used for training with no pre-processing nor post-processing needed. Moreover, the library supports fine-tuning with new data to generate a custom address parser.
翻译:将地址分割为有意义的组成部分(即地址解析)是从记录链接到地理编码及包裹递送等众多应用中的关键步骤。因此,大量研究致力于开发精准的地址解析技术,其中机器学习和神经网络方法在性能排行榜上处于领先地位。然而,大多数地址解析研究工作局限于学术领域,缺乏免费且易于使用的开源解决方案。本文提出Deepparse——一个基于LGPL-3.0许可的Python开源、可扩展、可微调的地址解析方案,利用先进深度学习算法解析跨国地址,并在超过60个国家进行评测。该方案能够解析任何语言编写的地址并适配任意地址标准。预训练模型在训练所覆盖的国家上平均解析准确率达$99~\%$,且无需任何预处理或后处理。此外,该库支持使用新数据进行微调以生成定制化地址解析器。