Cybersecurity information is often technically complex and relayed through unstructured text, making automation of cyber threat intelligence highly challenging. For such text domains that involve high levels of expertise, pretraining on in-domain corpora has been a popular method for language models to obtain domain expertise. However, cybersecurity texts often contain non-linguistic elements (such as URLs and hash values) that could be unsuitable with the established pretraining methodologies. Previous work in other domains have removed or filtered such text as noise, but the effectiveness of these methods have not been investigated, especially in the cybersecurity domain. We propose different pretraining methodologies and evaluate their effectiveness through downstream tasks and probing tasks. Our proposed strategy (selective MLM and jointly training NLE token classification) outperforms the commonly taken approach of replacing non-linguistic elements (NLEs). We use our domain-customized methodology to train CyBERTuned, a cybersecurity domain language model that outperforms other cybersecurity PLMs on most tasks.
翻译:网络安全信息通常技术复杂且通过非结构化文本传递,这使得网络威胁情报的自动化极具挑战性。对于此类涉及高度专业知识的文本领域,基于领域语料库的预训练已成为语言模型获取领域专长的常用方法。然而,网络安全文本常包含非语言元素(如URL和哈希值),这些元素可能不适用于现有的预训练方法。先前在其他领域的研究往往将这些文本作为噪声移除或过滤,但这些方法的有效性尚未得到充分探究,尤其是在网络安全领域。我们提出了不同的预训练方法,并通过下游任务和探测任务评估其有效性。我们提出的策略(选择性MLM与联合训练NLE令牌分类)优于常见的替换非语言元素(NLE)的方法。我们利用这一领域定制化方法训练了CyBERTuned,这是一个网络安全领域语言模型,在大多数任务上优于其他网络安全预训练语言模型(PLM)。