The World Wide Web's connectivity is greatly attributed to the HTTP protocol, with HTTP messages offering informative header fields that appeal to disciplines like web security and privacy, especially concerning web tracking. Despite existing research employing HTTP request messages to identify web trackers, HTTP response headers are often overlooked. This study endeavors to design effective machine learning classifiers for web tracker detection using binarized HTTP response headers. Data from the Chrome, Firefox, and Brave browsers, obtained through the traffic monitoring browser extension T.EX, serves as our dataset. Ten supervised models were trained on Chrome data and tested across all browsers, including a Chrome dataset from a year later. The results demonstrated high accuracy, F1-score, precision, recall, and minimal log-loss error for Chrome and Firefox, but subpar performance on Brave, potentially due to its distinct data distribution and feature set. The research suggests that these classifiers are viable for web tracker detection. However, real-world application testing remains pending, and the distinction between tracker types and broader label sources could be explored in future studies.
翻译:万维网的连通性在很大程度上归功于HTTP协议,HTTP消息提供了信息丰富的头部字段,这些字段吸引了诸如网络安全与隐私等学科的关注,尤其是在网络追踪方面。尽管现有研究利用HTTP请求消息来识别网络追踪器,但HTTP响应头却常被忽视。本研究致力于利用二值化的HTTP响应头设计有效的机器学习分类器以进行网络追踪器检测。通过流量监控浏览器扩展T.EX获取的来自Chrome、Firefox和Brave浏览器的数据构成了我们的数据集。我们在Chrome数据上训练了十个监督模型,并在所有浏览器(包括一年后的Chrome数据集)上进行了测试。结果显示,在Chrome和Firefox上取得了高准确率、F1分数、精确率、召回率以及极低的对数损失误差,但在Brave上表现欠佳,这可能源于其独特的数据分布和特征集。研究表明,这些分类器适用于网络追踪器检测。然而,实际应用测试仍有待进行,追踪器类型与更广泛标签来源之间的区分可在未来研究中进一步探索。