Web measurements are a well-established methodology for assessing the security and privacy landscape of the Internet. However, existing top lists of popular websites are unlabeled and lack semantic information about the nature of the included websites, making targeted web measurements challenging, as researchers often rely on ad-hoc techniques to bias datasets toward specific website classes of interest. In this paper, we investigate the use of Large Language Models (LLMs) to enable targeted web measurement studies. Building on prior literature, we identify key website classification tasks relevant to web measurements and highlight limitations in state-of-the-art classification approaches. We construct carefully curated datasets to evaluate different LLMs on these tasks. Our results show that LLMs can achieve strong performance across multiple classification scenarios, but the choice of model and configuration plays a significant role. Motivated by the observed trade-off between classification accuracy and computational efficiency, we propose a practical two-step methodology for scalable targeted web measurements starting from the Tranco list. Finally, we conduct LLM-assisted web measurement studies inspired by prior work using our methodology and assess the validity of the resulting research inferences, showing that LLMs can effectively enable targeted measurements of security and privacy trends on the Web.
翻译:Web测量是评估互联网安全与隐私现状的成熟方法论。然而,现有流行网站顶级列表缺乏标注,且不包含网站性质的相关语义信息,这导致定向Web测量面临挑战——研究人员常依赖临时性技术对数据集进行偏差调整,以聚焦特定类别的目标网站。本文探究利用大语言模型实现定向Web测量的可行性。基于既有文献,我们识别出与Web测量相关的关键网站分类任务,并揭示现有最先进分类方法的局限性。通过构建精心筛选的数据集,我们在不同分类场景中评估多种大语言模型的性能。结果表明,大语言模型在多项分类任务中表现优异,但模型选择与配置参数对结果具有显著影响。鉴于分类精度与计算效率之间的权衡,我们提出一种从Tranco列表出发、可扩展的定向Web测量实用两阶段方法。最后,借鉴先前研究,我们运用该方法开展LLM辅助的Web测量实验,并评估所得研究推论的有效性,证实大语言模型能够高效支持Web安全与隐私趋势的定向测量。