The World Wide Web's connectivity is greatly attributed to the HTTP protocol, with HTTP messages offering informative header fields that appeal to disciplines like web security and privacy, especially concerning web tracking. Despite existing research employing HTTP request messages to identify web trackers, HTTP response headers are often overlooked. This study endeavors to design effective machine learning classifiers for web tracker detection using binarized HTTP response headers. Data from the Chrome, Firefox, and Brave browsers, obtained through the traffic monitoring browser extension T.EX, serves as our dataset. Ten supervised models were trained on Chrome data and tested across all browsers, including a Chrome dataset from a year later. The results demonstrated high accuracy, F1-score, precision, recall, and minimal log-loss error for Chrome and Firefox, but subpar performance on Brave, potentially due to its distinct data distribution and feature set. The research suggests that these classifiers are viable for web tracker detection. However, real-world application testing remains pending, and the distinction between tracker types and broader label sources could be explored in future studies.
翻译:万维网的连通性在很大程度上归功于HTTP协议,其中HTTP报文提供了信息丰富的头部字段,这对网络安全与隐私等领域具有重要价值,尤其是在网络追踪方面。尽管现有研究已利用HTTP请求报文来识别网络追踪器,但HTTP响应头往往被忽视。本研究致力于利用二值化的HTTP响应头设计有效的机器学习分类器以检测网络追踪器。通过流量监控浏览器扩展T.EX获取的Chrome、Firefox和Brave浏览器的数据构成了我们的数据集。我们在Chrome数据上训练了十种监督模型,并在所有浏览器(包括一年后的Chrome数据集)上进行了测试。结果显示,在Chrome和Firefox上取得了较高的准确率、F1分数、精确率、召回率以及较低的对数损失误差,但在Brave上表现欠佳,这可能源于其独特的数据分布和特征集。研究表明这些分类器可用于网络追踪器检测。然而,实际应用测试仍有待进行,且追踪器类型与更广泛标签来源之间的区分可在未来研究中进一步探索。