The use of ML in cybersecurity has long been impaired by generalization issues: Models that work well in controlled scenarios fail to maintain performance in production. The root cause often lies in ML algorithms learning superficial patterns (shortcuts) rather than underlying cybersecurity concepts. We investigate contrastive multi-modal learning as a first step towards improving ML performance in cybersecurity tasks. We aim at transferring knowledge from data-rich modalities, such as text, to data-scarce modalities, such as payloads. We set up a case study on threat classification and propose a two-stage multi-modal contrastive learning framework that uses textual vulnerability descriptions to guide payload classification. First, we construct a semantically meaningful embedding space using contrastive learning on descriptions. Then, we align payloads to this space, transferring knowledge from text to payloads. We evaluate the approach on a large-scale private dataset and a synthetic benchmark built from public CVE descriptions and LLM-generated payloads. The methodology appears to reduce shortcut learning over baselines on both benchmarks. We release our synthetic benchmark and source code as open source.
翻译:机器学习在网络安全领域的应用长期受困于泛化问题:在受控场景表现良好的模型,投入生产环境后却难以维持性能。根本原因往往在于机器学习算法学习的是表层模式(捷径),而非底层的网络安全概念。我们探索对比多模态学习,作为提升机器学习在网络安全任务中性能的第一步。我们的目标是将知识从数据丰富的模态(如文本)迁移至数据稀缺的模态(如载荷)。我们以威胁分类为案例展开研究,提出一种两阶段多模态对比学习框架,利用文本漏洞描述来指导载荷分类。首先,我们通过对文本描述进行对比学习,构建具有语义意义的嵌入空间。然后,将载荷与该空间对齐,实现从文本到载荷的知识迁移。我们在大规模私有数据集以及基于公开CVE描述和LLM生成载荷构建的合成基准上评估该方法。在两个基准测试中,该方法相较基线均减少了捷径学习现象。我们将合成基准和源代码作为开源资源发布。