This paper introduces Fundus, a user-friendly news scraper that enables users to obtain millions of high-quality news articles with just a few lines of code. Unlike existing news scrapers, we use manually crafted, bespoke content extractors that are specifically tailored to the formatting guidelines of each supported online newspaper. This allows us to optimize our scraping for quality such that retrieved news articles are textually complete and without HTML artifacts. Further, our framework combines both crawling (retrieving HTML from the web or large web archives) and content extraction into a single pipeline. By providing a unified interface for a predefined collection of newspapers, we aim to make Fundus broadly usable even for non-technical users. This paper gives an overview of the framework, discusses our design choices, and presents a comparative evaluation against other popular news scrapers. Our evaluation shows that Fundus yields significantly higher quality extractions (complete and artifact-free news articles) than prior work. The framework is available on GitHub under https://github.com/flairNLP/fundus and can be simply installed using pip.
翻译:本文介绍了一种名为Fundus的用户友好型新闻爬虫,它能让用户仅用几行代码就能获取数百万篇高质量的新闻文章。与现有的新闻爬虫不同,我们采用了人工精心定制的内容提取器,这些提取器专门针对每个受支持的在线报纸的格式指南进行调整。这使得我们能够优化爬取质量,确保检索到的新闻文章文本完整且无HTML残留物。此外,我们的框架将爬取(从网络或大型网络档案中检索HTML)和内容提取整合到单个流水线中。通过为预定义的报纸集合提供统一接口,我们旨在使Fundus对非技术用户也同样适用。本文概述了该框架,讨论了我们的设计选择,并与其他流行的新闻爬虫进行了比较评估。评估结果表明,Fundus在提取质量(完整且无伪影的新闻文章)方面显著优于先前的工作。该框架已在GitHub上以https://github.com/flairNLP/fundus提供,并可简单通过pip安装。