News data have become essential resources across various disciplines. Still, access to full-text news corpora remains challenging due to high costs and the limited availability of free alternatives. This paper presents a novel Python package (gdeltnews) that reconstructs full-text newspaper articles at near-zero cost by leveraging the Global Database of Events, Language, and Tone (GDELT) Web News NGrams 3.0 dataset. Our method merges overlapping n-grams extracted from global online news to rebuild complete articles. We validate the approach on a benchmark set of 2211 articles from major U.S. news outlets, achieving up to 95% text similarity against original articles based on Levenshtein and SequenceMatcher metrics. Our tool facilitates economic forecasting, computational social science, information science, and natural language processing applications by enabling free and large-scale access to full-text news data.
翻译:新闻数据已成为多学科领域的重要资源。然而,由于高昂的成本及免费替代方案的稀缺,获取全文新闻语料库仍面临挑战。本文提出了一种新颖的Python工具包(gdeltnews),通过利用全球事件、语言与语调数据库(GDELT)的Web新闻NGrams 3.0数据集,以近乎零成本的方式重构报纸全文报道。该方法通过合并从全球在线新闻中提取的重叠n-gram片段来重建完整文章。我们在来自美国主要新闻机构的2211篇基准文章上验证了该方法,基于Levenshtein距离和SequenceMatcher指标,重构文本与原始文章的相似度最高可达95%。该工具通过提供免费的大规模全文新闻数据访问,为经济预测、计算社会科学、信息科学及自然语言处理等应用领域提供了便利。