Cookiescanner: An Automated Tool for Detecting and Evaluating GDPR Consent Notices on Websites

The enforcement of the GDPR led to the widespread adoption of consent notices, colloquially known as cookie banners. Studies have shown that many website operators do not comply with the law and track users prior to any interaction with the consent notice, or attempt to trick users into giving consent through dark patterns. Previous research has relied on manually curated filter lists or automated detection methods limited to a subset of websites, making research on GDPR compliance of consent notices tedious or limited. We present \emph{cookiescanner}, an automated scanning tool that detects and extracts consent notices via various methods and checks if they offer a decline option or use color diversion. We evaluated cookiescanner on a random sample of the top 10,000 websites listed by Tranco. We found that manually curated filter lists have the highest precision but recall fewer consent notices than our keyword-based methods. Our BERT model achieves high precision for English notices, which is in line with previous work, but suffers from low recall due to insufficient candidate extraction. While the automated detection of decline options proved to be challenging due to the dynamic nature of many sites, detecting instances of different colors of the buttons was successful in most cases. Besides systematically evaluating our various detection techniques, we have manually annotated 1,000 websites to provide a ground-truth baseline, which has not existed previously. Furthermore, we release our code and the annotated dataset in the interest of reproducibility and repeatability.

翻译：GDPR的实施导致同意通知（俗称Cookie横幅）被广泛采用。研究表明，许多网站运营商在用户与同意通知交互之前就违反法律跟踪用户，或试图通过暗黑模式诱导用户同意。以往的研究依赖于人工维护的过滤列表或仅限于部分网站的自动检测方法，使得关于同意通知的GDPR合规性研究变得繁琐或有限。我们提出了一种自动扫描工具\emph{cookiescanner}，该工具通过多种方法检测和提取同意通知，并检查其是否提供拒绝选项或使用颜色误导。我们在Tranco列出的前10,000个网站的随机样本上评估了cookiescanner。我们发现，人工维护的过滤列表精准度最高，但其召回的同意通知数量少于我们基于关键词的方法。我们的BERT模型对英文通知具有高精准度（与先前研究一致），但由于候选提取不足而召回率较低。虽然由于许多网站的动态特性，自动检测拒绝选项具有挑战性，但在大多数情况下成功检测了按钮颜色差异。除了系统评估我们的多种检测技术外，我们还手动标注了1,000个网站以提供此前不存在的基础真实数据基线。此外，为促进可重复性研究，我们公开了代码和标注数据集。

相关内容

Automator

关注 5

Automator是苹果公司为他们的Mac OS X系统开发的一款软件。 只要通过点击拖拽鼠标等操作就可以将一系列动作组合成一个工作流，从而帮助你自动的（可重复的）完成一些复杂的工作。Automator还能横跨很多不同种类的程序，包括：查找器、Safari网络浏览器、iCal、地址簿或者其他的一些程序。它还能和一些第三方的程序一起工作，如微软的Office、Adobe公司的Photoshop或者Pixelmator等。

O’Reilly报告：知识图谱崛起——面向现代数据集成和数据结构体系，“The Rise of the Knowledge Graph——Toward Modern Data Integration and the Data Fabric Architecture”

专知会员服务

49+阅读 · 2022年2月18日

【CHI2020-微软】解释可解释性:理解数据科学家使用机器学习的可解释性工具，Interpreting Interpretability: Understanding Data Scientists’Use of Interpretability Tools for Machine Learning

专知会员服务

55+阅读 · 2020年3月8日

FlowQA: Grasping Flow in History for Conversational Machine Comprehension

专知会员服务

34+阅读 · 2019年10月18日

Auto-Sizing the Transformer Network: Improving Speed, Efficiency, and Performance for Low-Resource Machine Translation

专知会员服务

50+阅读 · 2019年10月17日