Large Language Models (LLMs) are increasingly relying on web crawling to stay up to date and accurately answer user queries. These crawlers are expected to honor robots.txt files, which govern automated access. In this study, for the first time, we investigate whether reputable news websites and misinformation sites differ in how they configure these files, particularly in relation to AI crawlers. Analyzing a curated dataset, we find a stark contrast: 60.0% of reputable sites disallow at least one AI crawler, compared to just 9.1% of misinformation sites in their robots.txt files. Reputable sites forbid an average of 15.5 AI user agents, while misinformation sites prohibit fewer than one. We then measure active blocking behavior, where websites refuse to return content when HTTP requests include AI crawler user agents, and reveal that both categories of websites utilize it. Notably, the behavior of reputable news websites in this regard aligns more closely with their declared robots.txt directive than that of misinformation websites. Finally, our longitudinal analysis reveals that this gap has widened over time, with AI-blocking by reputable sites rising from 23% in September 2023 to nearly 60% by May 2025. Our findings highlight a growing asymmetry in content accessibility that may shape the training data available to LLMs, raising essential questions for web transparency, data ethics, and the future of AI training practices.
翻译:大型语言模型(LLM)日益依赖网络爬虫来保持信息更新并准确回应用户查询。这些爬虫程序理应遵守用于规范自动化访问的robots.txt文件。在本研究中,我们首次探究了信誉良好的新闻网站与错误信息网站在配置此类文件时是否存在差异,特别是针对AI爬虫的配置。通过分析一个精选数据集,我们发现了一个鲜明对比:60.0%的信誉网站在其robots.txt文件中禁止至少一个AI爬虫,而错误信息网站的这一比例仅为9.1%。信誉网站平均禁止15.5个AI用户代理,而错误信息网站禁止的数量不足一个。随后,我们测量了主动屏蔽行为(即当HTTP请求包含AI爬虫用户代理时网站拒绝返回内容),并发现两类网站均采用了此行为。值得注意的是,信誉新闻网站在此方面的行为与其声明的robots.txt指令的吻合度远高于错误信息网站。最后,我们的纵向分析表明,这一差距随时间推移而扩大:信誉网站的AI屏蔽比例从2023年9月的23%上升至2025年5月的近60%。我们的研究结果凸显了内容可访问性方面日益增长的不对称性,这可能会影响LLM可获得的训练数据,从而对网络透明度、数据伦理以及AI训练实践的未来提出了至关重要的疑问。