Recent research has suggested that there are clear differences in the language used in the Dark Web compared to that of the Surface Web. As studies on the Dark Web commonly require textual analysis of the domain, language models specific to the Dark Web may provide valuable insights to researchers. In this work, we introduce DarkBERT, a language model pretrained on Dark Web data. We describe the steps taken to filter and compile the text data used to train DarkBERT to combat the extreme lexical and structural diversity of the Dark Web that may be detrimental to building a proper representation of the domain. We evaluate DarkBERT and its vanilla counterpart along with other widely used language models to validate the benefits that a Dark Web domain specific model offers in various use cases. Our evaluations show that DarkBERT outperforms current language models and may serve as a valuable resource for future research on the Dark Web.
翻译:近期研究表明,暗网(Dark Web)与表层网(Surface Web)所使用的语言存在显著差异。由于暗网研究通常需要对该领域进行文本分析,针对暗网定制的语言模型可为研究人员提供宝贵见解。本研究提出DarkBERT——一个基于暗网数据预训练的语言模型。我们详细阐述了为应对暗网页面的极端词汇与结构多样性(这种多样性可能阻碍对该领域的准确表征)而采取的文本数据筛选与编译步骤。通过将DarkBERT及其基础版本与其他广泛使用的语言模型进行对比评估,我们验证了暗网领域专用模型在多种应用场景中的优势。评估结果表明,DarkBERT在性能上超越现有语言模型,有望成为暗网未来研究的重要工具。