News Harvesting from Google News combining Web Scraping, LLM Metadata Extraction and SCImago Media Rankings enrichment: a case study of IFMIF-DONES

This study develops and evaluates a systematic methodology for constructing news datasets from Google News, combining automated web scraping, large language model (LLM)-based metadata extraction, and SCImago Media Rankings enrichment. Using the IFMIF-DONES fusion energy project as a case study, we implemented a five-stage data collection pipeline across 81 region-language combinations, yielding 1,482 validated records after a 56% noise reduction. Results are compared against two licensed press databases: MyNews (2,280 records) and ProQuest Newsstream Collection (148 records). Overlap analysis reveals high complementarity, with 76% of Google News records exclusive to this platform. The dataset captures content types absent from proprietary databases, including specialized outlets, institutional communications, and social media posts. However, significant methodological challenges emerge: temporal instability requiring synchronic collection, a 100-result cap per query demanding multi-stage strategies, and unexpected noise including academic PDFs, false positives, and pornographic content infiltrating results through black hat SEO techniques. LLM-assisted extraction proved effective for structured articles but exhibited systematic hallucination patterns requiring validation protocols. We conclude that Google News offers valuable complementary coverage for communication research but demands substantial methodological investment, multi-source triangulation, and robust filtering mechanisms to ensure dataset integrity.

翻译：本研究开发并评估了一种从谷歌新闻构建新闻数据集的系统方法，该方法融合了自动化网络爬取、基于大语言模型（LLM）的元数据提取以及SCImago媒体排名增强。以IFMIF-DONES聚变能源项目为案例，我们实施了跨越81个区域-语言组合的五阶段数据采集流程，在实现56%的噪声削减后，获得了1,482条已验证记录。研究结果与两个授权新闻数据库——MyNews（2,280条记录）和ProQuest Newsstream Collection（148条记录）——进行了对比。重叠分析显示出高度的互补性，谷歌新闻记录中有76%为该平台独有。该数据集捕获了专有数据库中缺失的内容类型，包括专业媒体、机构通讯和社交媒体帖子。然而，方法学上出现了显著挑战：需要同步采集以应对时间不稳定性；每次查询的100条结果上限要求采用多阶段策略；以及包括学术PDF、误报和通过黑帽SEO技术渗透结果的色情内容在内的意外噪声。LLM辅助提取对于结构化文章证明有效，但表现出需要验证协议的系统性幻觉模式。我们的结论是，谷歌新闻为传播学研究提供了有价值的补充覆盖，但需要大量的方法学投入、多源三角验证以及鲁棒的过滤机制，以确保数据集的完整性。