With much of our lives taking place online, researchers are increasingly turning to information from the World Wide Web to gain insights into geographic patterns and processes. Web scraping as an online data acquisition technique allows us to gather intelligence especially on social and economic actions for which the Web serves as a platform. Specific opportunities relate to near-real-time access to object-level geolocated data, which can be captured in a cost-effective way. The studied geographic phenomena include, but are not limited to, the rental market and associated processes such as gentrification, entrepreneurial ecosystems, or spatial planning processes. Since the information retrieved from the Web is not made available for that purpose, Web scraping faces several unique challenges, several of which relate to location. Ethical and legal issues mainly relate to intellectual property rights, informed consent and (geo-) privacy, and website integrity and contract. These issues also effect the practice of open science. In addition, there are technical and statistical challenges that relate to dependability and incompleteness, data inconsistencies and bias, as well as the limited historical coverage. Geospatial analyses furthermore usually require the automated extraction and subsequent resolution of toponyms or addresses (geoparsing, geocoding). A study on apartment rent in Leipzig, Germany is used to illustrate the use of Web scraping and its challenges. We conclude that geographic researchers should embrace Web scraping as a powerful and affordable digital fieldwork tool while paying special attention to its legal, ethical, and methodological challenges.
翻译:随着我们生活的很大一部分在线上进行,研究人员越来越多地转向万维网上的信息,以洞察地理格局与过程。网络爬取作为一种在线数据采集技术,使我们能够收集信息,特别是关于网络作为平台的社交与经济活动的信息。其特殊机遇涉及近实时获取对象级地理定位数据,且能以低成本方式实现捕获。所研究的地理现象包括但不限于租赁市场及相关过程(如绅士化)、创业生态系统或空间规划过程。由于从网络获取的信息并非为此目的而提供,网络爬取面临若干独特挑战,其中一些与位置相关。伦理和法律问题主要涉及知识产权、知情同意和(地理)隐私、以及网站完整性和合同。这些问题也影响开放科学实践。此外,还存在技术与统计挑战,涉及可靠性与不完整性、数据不一致与偏差、以及有限的历史覆盖范围。地理空间分析通常还需要自动提取并随后解析地名或地址(地理解析、地理编码)。以德国莱比锡的公寓租金研究为例,说明网络爬取的使用及其挑战。我们得出结论:地理研究者应将网络爬取视为一种强大且经济的数字田野调查工具,同时需特别关注其法律、伦理及方法论挑战。