The dark web has become notorious for its association with illicit activities and there is a growing need for systems to automate the monitoring of this space. This paper proposes an end-to-end scalable architecture for the early identification of new Tor sites and the daily analysis of their content. The solution is built using an Open Source Big Data stack for data serving with Kubernetes, Kafka, Kubeflow, and MinIO, continuously discovering onion addresses in different sources (threat intelligence, code repositories, web-Tor gateways, and Tor repositories), downloading the HTML from Tor and deduplicating the content using MinHash LSH, and categorizing with the BERTopic modeling (SBERT embedding, UMAP dimensionality reduction, HDBSCAN document clustering and c-TF-IDF topic keywords). In 93 days, the system identified 80,049 onion services and characterized 90% of them, addressing the challenge of Tor volatility. A disproportionate amount of repeated content is found, with only 6.1% unique sites. From the HTML files of the dark sites, 31 different low-topics are extracted, manually labeled, and grouped into 11 high-level topics. The five most popular included sexual and violent content, repositories, search engines, carding, cryptocurrencies, and marketplaces. During the experiments, we identified 14 sites with 13,946 clones that shared a suspiciously similar mirroring rate per day, suggesting an extensive common phishing network. Among the related works, this study is the most representative characterization of onion services based on topics to date.
翻译:暗网因其与非法活动的关联而臭名昭著,当前亟需自动化监控该空间的系统。本文提出一种端到端可扩展架构,用于早期识别新的Tor站点并对其内容进行日常分析。该解决方案基于开源大数据技术栈构建,采用Kubernetes、Kafka、Kubeflow和MinIO实现数据服务,持续从不同来源(威胁情报、代码仓库、Web-Tor网关及Tor仓库)发现洋葱地址,下载Tor的HTML内容并通过MinHash LSH进行去重,利用BERTopic模型(SBERT嵌入、UMAP降维、HDBSCAN文档聚类及c-TF-IDF主题关键词)进行分类。在93天中,该系统识别出80,049个洋葱服务并表征其中90%的服务,成功应对了Tor的波动性挑战。研究发现重复内容比例极高,仅6.1%为独立站点。从暗网站点HTML文件中提取出31个低层主题,经人工标注后归纳为11个高层主题。最热门的五类主题涵盖色情与暴力内容、代码仓库、搜索引擎、信用卡欺诈、加密货币及交易市场。实验期间,我们识别出14个站点存在13,946个克隆站点,这些站点每日具有异常相似的镜像频率,暗示存在大规模通用钓鱼网络。与现有研究相比,本研究是迄今为止基于主题对洋葱服务进行的最具代表性的特征刻画。