Domain probe lists--used to determine which URLs to probe for Web censorship--play a critical role in Internet censorship measurement studies. Indeed, the size and accuracy of the domain probe list limits the set of censored pages that can be detected; inaccurate lists can lead to an incomplete view of the censorship landscape or biased results. Previous efforts to generate domain probe lists have been mostly manual or crowdsourced. This approach is time-consuming, prone to errors, and does not scale well to the ever-changing censorship landscape. In this paper, we explore methods for automatically generating probe lists that are both comprehensive and up-to-date for Web censorship measurement. We start from an initial set of 139,957 unique URLs from various existing test lists consisting of pages from a variety of languages to generate new candidate pages. By analyzing content from these URLs (i.e., performing topic and keyword extraction), expanding these topics, and using them as a feed to search engines, our method produces 119,255 new URLs across 35,147 domains. We then test the new candidate pages by attempting to access each URL from servers in eleven different global locations over a span of four months to check for their connectivity and potential signs of censorship. Our measurements reveal that our method discovered over 1,400 domains--not present in the original dataset--we suspect to be blocked. In short, automatically updating probe lists is possible, and can help further automate censorship measurements at scale.
翻译:域名探测列表在互联网审查测量研究中起着关键作用,其用于确定需要探测哪些URL以检测网络审查。事实上,域名探测列表的规模和准确性限制了可检测到的被审查页面集合;不准确的列表可能导致对审查格局的不完整认知或产生有偏差的结果。以往生成域名探测列表的工作大多依赖于人工或众包方式。这种方法耗时、易出错,且难以适应不断变化的审查格局。本文探索了为网络审查测量自动生成既全面又最新的探测列表的方法。我们从包含多种语言页面的各类现有测试列表中选取139,957个唯一URL作为初始集合,用于生成新的候选页面。通过分析这些URL的内容(即进行主题和关键词提取)、扩展这些主题,并将其作为搜索引擎的输入,我们的方法生成了覆盖35,147个域名的119,255个新URL。随后,我们通过在十一个不同全球位置的服务器上尝试访问每个URL,历时四个月测试这些新候选页面的连通性及潜在的审查迹象。测量结果表明,我们的方法发现了超过1,400个在原数据集中不存在且疑似被屏蔽的域名。简而言之,自动更新探测列表是可行的,并能有助于进一步实现大规模审查测量的自动化。