Website fingerprinting (WF) is a dangerous attack on web privacy because it enables an adversary to predict the website a user is visiting, despite the use of encryption, VPNs, or anonymizing networks such as Tor. Previous WF work almost exclusively uses synthetic datasets to evaluate the performance and estimate the feasibility of WF attacks despite evidence that synthetic data misrepresents the real world. In this paper we present GTT23, the first WF dataset of genuine Tor traces, which we obtain through a large-scale measurement of the Tor network and which is intended especially for WF. It represents real Tor user behavior better than any existing WF dataset, is larger than any existing WF dataset by at least an order of magnitude, and will help ground the future study of realistic WF attacks and defenses. In a detailed evaluation, we survey 28 WF datasets published since 2008 and compare their characteristics to those of GTT23. We discover common deficiencies of synthetic datasets that make them inferior to GTT23 for drawing meaningful conclusions about the effectiveness of WF attacks directed at real Tor users. We have made GTT23 available to promote reproducible research and to help inspire new directions for future work.
翻译:网站指纹识别(WF)是一种对网络隐私构成严重威胁的攻击手段,它使得攻击者能够预测用户正在访问的网站,即使用户使用了加密、VPN或Tor等匿名网络。尽管有证据表明合成数据无法准确反映真实世界情况,但以往的WF研究几乎完全依赖合成数据集来评估攻击性能并估计其可行性。本文提出了首个专门用于WF研究的真实Tor流量数据集GTT23,该数据集通过对Tor网络进行大规模测量获得。相较于现有所有WF数据集,GTT23能更准确地反映真实Tor用户行为,其规模至少比现有最大数据集大一个数量级,将为未来研究现实场景下的WF攻击与防御奠定基础。通过详细评估,我们系统调研了2008年以来发布的28个WF数据集,并将其特征与GTT23进行对比。研究发现,合成数据集普遍存在缺陷,导致其难以就针对真实Tor用户的WF攻击有效性得出有意义的结论。为促进可重复研究并启发未来工作新方向,我们已公开GTT23数据集。