The ability to programmatically retrieve vast quantities of data from online sources has given rise to increasing usage of web-scraped datasets for various purposes across government, industry and academia. Contemporaneously, there has also been growing discussion about the statistical qualities and limitations of collecting from online data sources and analysing web-scraped datasets. However, literature on web-scraping is distributed across computer science, statistical methodology and application domains, with distinct and occasionally conflicting definitions of web-scraping and conceptualisations of web-scraped data quality. This work synthesises technical and statistical concepts, best practices and insights across these relevant disciplines to inform documentation during web-scraping processes, and quality assessment of the resultant web-scraped datasets. We propose an integrated framework to cover multiple processes during the creation of web-scraped datasets including 'Plan', 'Retrieve', 'Investigate', 'Transform', 'Evaluate' and 'Summarise' (PRITES). The framework groups related quality factors which should be monitored during the collection of new web-scraped data, and/or investigated when assessing potential applications of existing web-scraped datasets. We connect each stage to existing discussions of technical and statistical challenges in collecting and analysing web-scraped data. We then apply the framework to describe related work by the co-authors to adapt web-scraped retail prices for alcoholic beverages collected by an industry data partner into analysis-ready datasets for public health policy research. The case study illustrates how the framework supports accurate and comprehensive scientific reporting of studies using web-scraped datasets.
翻译:通过编程方式从在线来源获取海量数据的能力,促进了网络爬取数据集在政府、工业和学术界等不同领域中的日益广泛应用。与此同时,关于从在线数据源收集和分析网络爬取数据集的统计特性与局限性的讨论也日益增多。然而,有关网络爬取的文献分散在计算机科学、统计方法学和应用领域之间,对网络爬取的定义以及网络爬取数据质量的概念化存在差异,有时甚至相互矛盾。本研究综合了相关学科中的技术与统计概念、最佳实践和见解,旨在为网络爬取过程中的文档记录以及所得网络爬取数据集的质量评估提供指导。我们提出了一个集成框架,涵盖网络爬取数据集创建过程中的多个环节,包括“规划”、“检索”、“调查”、“转换”、“评估”和“总结”(PRITES)。该框架将相关质量因素分组,这些因素应在收集新的网络爬取数据时进行监控,或在评估现有网络爬取数据集的潜在应用时进行调查。我们将每个阶段与现有关于收集和分析网络爬取数据时面临的技术和统计挑战的讨论联系起来。随后,我们应用该框架描述了合著者的相关工作,即与行业数据合作伙伴合作,将网络爬取的酒精饮料零售价格数据调整为适用于公共卫生政策研究的分析就绪数据集。该案例研究展示了该框架如何支持使用网络爬取数据集的研究进行准确而全面的科学报告。