Large language models (LLMs) exhibit emerging geospatial capabilities, stemming from their pre-training on vast unlabelled text datasets that are often derived from the Common Crawl corpus. However, the geospatial content within CC remains largely unexplored, impacting our understanding of LLMs' spatial reasoning. This paper investigates the prevalence of geospatial data in recent Common Crawl releases using Gemini, a powerful language model. By analyzing a sample of documents and manually revising the results, we estimate that between 1 in 5 and 1 in 6 documents contain geospatial information such as coordinates and street addresses. Our findings provide quantitative insights into the nature and extent of geospatial data within Common Crawl, and web crawl data in general. Furthermore, we formulate questions to guide future investigations into the geospatial content of available web crawl datasets and its influence on LLMs.
翻译:大型语言模型(LLM)展现出新兴的地理空间能力,这源于其基于海量无标注文本数据集的预训练,而这些数据集通常源自Common Crawl语料库。然而,Common Crawl中的地理空间内容在很大程度上仍未得到充分探索,这影响了我们对LLM空间推理能力的理解。本文利用强大的语言模型Gemini,调查了近期Common Crawl版本中地理空间数据的普遍性。通过分析文档样本并人工复核结果,我们估计每5到6份文档中就有1份包含坐标和街道地址等地理空间信息。我们的发现为Common Crawl乃至一般网络爬虫数据中地理空间数据的性质与范围提供了量化见解。此外,我们提出了一系列问题,以指导未来对现有网络爬虫数据集的地理空间内容及其对LLM影响的研究。