The enforcement of the GDPR led to the widespread adoption of consent notices, colloquially known as cookie banners. Studies have shown that many website operators do not comply with the law and track users prior to any interaction with the consent notice, or attempt to trick users into giving consent through dark patterns. Previous research has relied on manually curated filter lists or automated detection methods limited to a subset of websites, making research on GDPR compliance of consent notices tedious or limited. We present \emph{cookiescanner}, an automated scanning tool that detects and extracts consent notices via various methods and checks if they offer a decline option or use color diversion. We evaluated cookiescanner on a random sample of the top 10,000 websites listed by Tranco. We found that manually curated filter lists have the highest precision but recall fewer consent notices than our keyword-based methods. Our BERT model achieves high precision for English notices, which is in line with previous work, but suffers from low recall due to insufficient candidate extraction. While the automated detection of decline options proved to be challenging due to the dynamic nature of many sites, detecting instances of different colors of the buttons was successful in most cases. Besides systematically evaluating our various detection techniques, we have manually annotated 1,000 websites to provide a ground-truth baseline, which has not existed previously. Furthermore, we release our code and the annotated dataset in the interest of reproducibility and repeatability.
翻译:GDPR的实施导致同意通知(俗称Cookie横幅)被广泛采用。研究表明,许多网站运营商在用户与同意通知交互之前就违反法律跟踪用户,或试图通过暗黑模式诱导用户同意。以往的研究依赖于人工维护的过滤列表或仅限于部分网站的自动检测方法,使得关于同意通知的GDPR合规性研究变得繁琐或有限。我们提出了一种自动扫描工具\emph{cookiescanner},该工具通过多种方法检测和提取同意通知,并检查其是否提供拒绝选项或使用颜色误导。我们在Tranco列出的前10,000个网站的随机样本上评估了cookiescanner。我们发现,人工维护的过滤列表精准度最高,但其召回的同意通知数量少于我们基于关键词的方法。我们的BERT模型对英文通知具有高精准度(与先前研究一致),但由于候选提取不足而召回率较低。虽然由于许多网站的动态特性,自动检测拒绝选项具有挑战性,但在大多数情况下成功检测了按钮颜色差异。除了系统评估我们的多种检测技术外,我们还手动标注了1,000个网站以提供此前不存在的基础真实数据基线。此外,为促进可重复性研究,我们公开了代码和标注数据集。