DeGenTWeb: A First Look at LLM-dominant Websites

Many recent news reports have claimed that content generated by large language models (LLMs) is taking over the web. However, these claims are typically not based on a representative sample of the web and the methodology underlying them is often opaque. Moreover, when aiming to minimize the chances of falsely attributing human-authored content to LLMs, we find that detectors of LLM-generated text perform much worse than advertised. Consequently, we lack an understanding of the true prevalence and characteristics of LLM content on the web. We describe DeGenTWeb which systematically identifies LLM-dominant websites: sites whose content has been generated using LLMs with little human input. We show how to adapt detectors of LLM-generated text for use on web pages, and how to aggregate detection results from multiple pages on a site for accurate site-level categorization. Using DeGenTWeb, we find that LLM-dominant sites are highly prevalent both in data from Common Crawl and in Bing's search results, and that this share is growing over time. We also show that continuing to accurately identify such sites appears challenging given the capabilities of the latest LLMs.

翻译：近期多篇新闻报道声称，大型语言模型生成的内容正占据网络主导地位。然而，这些说法通常缺乏代表性网络样本支撑，其研究方法也往往不够透明。此外，在力求降低将人类创作内容误判为LLM生成文本概率的评估中，我们发现现有检测器的实际表现远逊于其宣传效果。由此，我们对网络空间LLM内容的真实流行程度与特征仍缺乏认知。本文提出DeGenTWeb系统，通过系统性方法识别"LLM主导型网站"——即其内容主要由LLM生成且极少有人工介入的站点。我们展示了如何将LLM文本检测器适配至网页场景，并通过聚合网站内多页面的检测结果实现精准的站点级分类。基于DeGenTWeb的实证研究表明，LLM主导型网站在Common Crawl数据集与必应搜索结果中均高度普遍，且其占比随时间推移持续增长。同时我们发现，面对最新LLM的技术能力，持续精准识别此类网站正面临显著挑战。

相关内容

大语言模型

关注 66

大语言模型是基于海量文本数据训练的深度学习模型。它不仅能够生成自然语言文本，还能够深入理解文本含义，处理各种自然语言任务，如文本摘要、问答、翻译等。2023年，大语言模型及其在人工智能领域的应用已成为全球科技研究的热点，其在规模上的增长尤为引人注目，参数量已从最初的十几亿跃升到如今的一万亿。参数量的提升使得模型能够更加精细地捕捉人类语言微妙之处，更加深入地理解人类语言的复杂性。在过去的一年里，大语言模型在吸纳新知识、分解复杂任务以及图文对齐等多方面都有显著提升。随着技术的不断成熟，它将不断拓展其应用范围，为人类提供更加智能化和个性化的服务，进一步改善人们的生活和生产方式。

探索大型语言模型在网络安全中的作用：一项系统综述

专知会员服务

22+阅读 · 2025年4月27日

《探索大型语言模型在军事联盟网络红队中的应用潜力》最新论文

专知会员服务

31+阅读 · 2025年1月5日

大型概念模型：在句子表示空间中的语言建模

专知会员服务

18+阅读 · 2024年12月14日

【ICML2024】理解大型语言模型在规划中的作用，138页pdf

专知会员服务

50+阅读 · 2024年7月24日