Geoparsing is the task of estimating the latitude and longitude (coordinates) of location expressions in texts. Geoparsing must deal with the ambiguity of the expressions that indicate multiple locations with the same notation. For evaluating geoparsing systems, several corpora have been proposed in previous work. However, these corpora are small-scale and suffer from the coverage of location expressions on general domains. In this paper, we propose Wikipedia Hyperlink-based Location Linking (WHLL), a novel method to construct a large-scale corpus for geoparsing from Wikipedia articles. WHLL leverages hyperlinks in Wikipedia to annotate multiple location expressions with coordinates. With this method, we constructed the WHLL corpus, a new large-scale corpus for geoparsing. The WHLL corpus consists of 1.3M articles, each containing about 7.8 unique location expressions. 45.6% of location expressions are ambiguous and refer to more than one location with the same notation. In each article, location expressions of the article title and those hyperlinks to other articles are assigned with coordinates. By utilizing hyperlinks, we can accurately assign location expressions with coordinates even with ambiguous location expressions in the texts. Experimental results show that there remains room for improvement by disambiguating location expressions.
翻译:地理解析是从文本中提取地点表述并估算其经纬度坐标的任务。该任务需要处理相同符号表示多个地点的歧义性问题。既往研究已提出多个用于评估地理解析系统的语料库,但这些语料库规模较小,且存在通用领域地点表述覆盖不足的问题。本文提出基于维基百科超链接的地点链接方法(WHLL),这是一种从维基百科文章构建大规模地理解析语料库的新方法。WHLL利用维基百科的超链接为多个地点表述标注坐标。通过该方法,我们构建了WHLL语料库——一个新型大规模地理解析语料库。该语料库包含130万篇文章,每篇平均含有约7.8个独立地点表述,其中45.6%的地点表述存在歧义,即一个符号指代多个地点。在每篇文章中,文章标题对应的地点表述以及指向其他文章的超链接中的地点表述均被赋予坐标。通过利用超链接,即使文本存在歧义性地点表述,也能准确为其分配坐标。实验结果表明,通过消除地点表述歧义仍有提升空间。