With the increase in cybersecurity vulnerabilities of software systems, the ways to exploit them are also increasing. Besides these, malware threats, irregular network interactions, and discussions about exploits in public forums are also on the rise. To identify these threats faster, to detect potentially relevant entities from any texts, and to be aware of software vulnerabilities, automated approaches are necessary. Application of natural language processing (NLP) techniques in the Cybersecurity domain can help in achieving this. However, there are challenges such as the diverse nature of texts involved in the cybersecurity domain, the unavailability of large-scale publicly available datasets, and the significant cost of hiring subject matter experts for annotations. One of the solutions is building multi-task models that can be trained jointly with limited data. In this work, we introduce a generative multi-task model, Unified Text-to-Text Cybersecurity (UTS), trained on malware reports, phishing site URLs, programming code constructs, social media data, blogs, news articles, and public forum posts. We show UTS improves the performance of some cybersecurity datasets. We also show that with a few examples, UTS can be adapted to novel unseen tasks and the nature of data
翻译:随着软件系统网络安全漏洞的增加,利用这些漏洞的手段也在不断增多。除此之外,恶意软件威胁、异常网络交互以及公共论坛中关于漏洞利用的讨论也在日益增加。为了更快地识别这些威胁、从任意文本中检测潜在相关实体并了解软件漏洞,自动化方法必不可少。应用自然语言处理(NLP)技术于网络安全领域有助于实现这一目标。然而,存在诸多挑战,例如网络安全领域涉及的文本多样性、大规模公开数据集的缺乏,以及聘请领域专家进行标注的高昂成本。解决方案之一是多任务模型的构建,该模型可在有限数据上联合训练。在本工作中,我们提出了一个生成式多任务模型——统一文本到文本网络安全模型(UTS),该模型基于恶意软件报告、钓鱼网站URL、编程代码结构、社交媒体数据、博客、新闻文章和公共论坛帖子进行训练。我们展示了UTS提升了某些网络安全数据集的性能,并表明通过少量示例,UTS可适应未见过的全新任务及数据特性。