Many recent news reports have claimed that content generated by large language models (LLMs) is taking over the web. However, these claims are typically not based on a representative sample of the web and the methodology underlying them is often opaque. Moreover, when aiming to minimize the chances of falsely attributing human-authored content to LLMs, we find that detectors of LLM-generated text perform much worse than advertised. Consequently, we lack an understanding of the true prevalence and characteristics of LLM content on the web. We describe DeGenTWeb which systematically identifies LLM-dominant websites: sites whose content has been generated using LLMs with little human input. We show how to adapt detectors of LLM-generated text for use on web pages, and how to aggregate detection results from multiple pages on a site for accurate site-level categorization. Using DeGenTWeb, we find that LLM-dominant sites are highly prevalent both in data from Common Crawl and in Bing's search results, and that this share is growing over time. We also show that continuing to accurately identify such sites appears challenging given the capabilities of the latest LLMs.
翻译:近期多篇新闻报道声称,大型语言模型生成的内容正占据网络主导地位。然而,这些说法通常缺乏代表性网络样本支撑,其研究方法也往往不够透明。此外,在力求降低将人类创作内容误判为LLM生成文本概率的评估中,我们发现现有检测器的实际表现远逊于其宣传效果。由此,我们对网络空间LLM内容的真实流行程度与特征仍缺乏认知。本文提出DeGenTWeb系统,通过系统性方法识别"LLM主导型网站"——即其内容主要由LLM生成且极少有人工介入的站点。我们展示了如何将LLM文本检测器适配至网页场景,并通过聚合网站内多页面的检测结果实现精准的站点级分类。基于DeGenTWeb的实证研究表明,LLM主导型网站在Common Crawl数据集与必应搜索结果中均高度普遍,且其占比随时间推移持续增长。同时我们发现,面对最新LLM的技术能力,持续精准识别此类网站正面临显著挑战。