Transfer Learning in Pre-Trained Large Language Models for Malware Detection Based on System Calls

In the current cybersecurity landscape, protecting military devices such as communication and battlefield management systems against sophisticated cyber attacks is crucial. Malware exploits vulnerabilities through stealth methods, often evading traditional detection mechanisms such as software signatures. The application of ML/DL in vulnerability detection has been extensively explored in the literature. However, current ML/DL vulnerability detection methods struggle with understanding the context and intent behind complex attacks. Integrating large language models (LLMs) with system call analysis offers a promising approach to enhance malware detection. This work presents a novel framework leveraging LLMs to classify malware based on system call data. The framework uses transfer learning to adapt pre-trained LLMs for malware detection. By retraining LLMs on a dataset of benign and malicious system calls, the models are refined to detect signs of malware activity. Experiments with a dataset of over 1TB of system calls demonstrate that models with larger context sizes, such as BigBird and Longformer, achieve superior accuracy and F1-Score of approximately 0.86. The results highlight the importance of context size in improving detection rates and underscore the trade-offs between computational complexity and performance. This approach shows significant potential for real-time detection in high-stakes environments, offering a robust solution to evolving cyber threats.

翻译：在当前网络安全形势下，保护通信系统、战场管理系统等军事设备免受复杂网络攻击至关重要。恶意软件通过隐蔽手段利用漏洞，往往能够规避软件签名等传统检测机制。机器学习/深度学习在漏洞检测中的应用已在文献中得到广泛研究，然而现有检测方法在理解复杂攻击的上下文和意图方面存在不足。将大语言模型与系统调用分析相结合，为增强恶意软件检测提供了一种有前景的方法。本文提出了一种创新框架，利用大语言模型基于系统调用数据进行恶意软件分类。该框架采用迁移学习技术，对预训练的大语言模型进行适配以完成恶意软件检测任务。通过在良性和恶意系统调用数据集上重新训练，模型被优化以识别恶意软件活动特征。基于超过1TB系统调用数据集的实验表明，BigBird和Longformer等具有更大上下文窗口的模型实现了约0.86的准确率和F1分数。研究结果凸显了上下文规模对提升检测率的重要性，并揭示了计算复杂度与性能之间的权衡关系。该方法在高风险环境下的实时检测中展现出显著潜力，为应对不断演变的网络威胁提供了鲁棒解决方案。

相关内容

MoDELS

关注 45

ACM/IEEE第23届模型驱动工程语言和系统国际会议，是模型驱动软件和系统工程的首要会议系列，由ACM-SIGSOFT和IEEE-TCSE支持组织。自1998年以来，模型涵盖了建模的各个方面，从语言和方法到工具和应用程序。模特的参加者来自不同的背景，包括研究人员、学者、工程师和工业专业人士。MODELS 2019是一个论坛，参与者可以围绕建模和模型驱动的软件和系统交流前沿研究成果和创新实践经验。今年的版本将为建模社区提供进一步推进建模基础的机会，并在网络物理系统、嵌入式系统、社会技术系统、云计算、大数据、机器学习、安全、开源等新兴领域提出建模的创新应用以及可持续性。官网链接：http://www.modelsconference.org/

O’Reilly报告：知识图谱崛起——面向现代数据集成和数据结构体系，“The Rise of the Knowledge Graph——Toward Modern Data Integration and the Data Fabric Architecture”

专知会员服务

49+阅读 · 2022年2月18日

UCM《机器学习导论笔记》，80页pdf CSE176 Introduction to Machine Learning

专知会员服务

32+阅读 · 2021年9月29日

生成性对抗网络:理论模型、评估指标和最近发展的概述，Generative Adversarial Networks (GANs): An Overview of Theoretical Model, Evaluation Metrics, and Recent Developments

专知会员服务

42+阅读 · 2020年5月30日

FlowQA: Grasping Flow in History for Conversational Machine Comprehension

专知会员服务

34+阅读 · 2019年10月18日