Recent research has suggested that there are clear differences in the language used in the Dark Web compared to that of the Surface Web. As studies on the Dark Web commonly require textual analysis of the domain, language models specific to the Dark Web may provide valuable insights to researchers. In this work, we introduce DarkBERT, a language model pretrained on Dark Web data. We describe the steps taken to filter and compile the text data used to train DarkBERT to combat the extreme lexical and structural diversity of the Dark Web that may be detrimental to building a proper representation of the domain. We evaluate DarkBERT and its vanilla counterpart along with other widely used language models to validate the benefits that a Dark Web domain specific model offers in various use cases. Our evaluations show that DarkBERT outperforms current language models and may serve as a valuable resource for future research on the Dark Web.
翻译:近期研究表明,暗网(Dark Web)与表层网(Surface Web)所使用的语言存在显著差异。由于暗网相关研究通常需要对该领域的文本进行分析,针对暗网定制的语言模型可为研究人员提供重要见解。本文提出DarkBERT——一个基于暗网数据预训练的语言模型。我们详细描述了为对抗暗网极端词汇与结构多样性(该特性可能阻碍构建该领域的准确表征),在训练DarkBERT时对文本数据进行的筛选与编译流程。我们将DarkBERT及其基础版本与其他广泛使用的语言模型进行对比评估,验证暗网领域专用模型在多种应用场景中的优势。评估结果表明,DarkBERT的性能优于现有语言模型,可为未来暗网研究提供重要资源。