Journalistic fact-checking, as well as social or economic research, require analyzing high-quality statistics datasets (SDs, in short). However, retrieving SD corpora at scale may be hard, inefficient, or impossible, depending on how they are published online. To improve open statistics data accessibility, we present a focused Web crawling algorithm that retrieves as many targets, i.e., resources of certain types, as possible, from a given website, in an efficient and scalable way, by crawling (much) less than the full website. We show that optimally solving this problem is intractable, and propose an approach based on reinforcement learning, namely using sleeping bandits. We propose SB-CLASSIFIER, a crawler that efficiently learns which hyperlinks lead to pages that link to many targets, based on the paths leading to the links in their enclosing webpages. Our experiments on websites with millions of webpages show that our crawler is highly efficient, delivering high fractions of a site's targets while crawling only a small part.
翻译:新闻事实核查以及社会或经济研究需要分析高质量统计数据(简称SD)。然而,大规模检索SD语料库可能困难、低效甚至无法实现,具体取决于其在网络上的发布方式。为提高开放统计数据可访问性,我们提出一种聚焦式网络爬取算法,该算法通过爬取(远少于)整个网站的方式,高效且可扩展地从给定网站检索尽可能多的目标资源(即特定类型的资源)。我们证明了该问题的最优求解是难解的,并提出一种基于强化学习(即使用休眠老虎机)的方法。我们提出了SB-CLASSIFIER爬虫,该爬虫能基于超链接所在网页的路径信息,高效学习哪些超链接指向包含大量目标资源的页面。在包含数百万网页的网站上的实验表明,我们的爬虫具有极高效率,仅需爬取网站的小部分内容即可获取大部分目标资源。