One of the pivotal security threats for the embedded computing systems is malicious software a.k.a malware. With efficiency and efficacy, Machine Learning (ML) has been widely adopted for malware detection in recent times. Despite being efficient, the existing techniques require a tremendous number of benign and malware samples for training and modeling an efficient malware detector. Furthermore, such constraints limit the detection of emerging malware samples due to the lack of sufficient malware samples required for efficient training. To address such concerns, we introduce a code-aware data generation technique that generates multiple mutated samples of the limitedly seen malware by the devices. Loss minimization ensures that the generated samples closely mimic the limitedly seen malware and mitigate the impractical samples. Such developed malware is further incorporated into the training set to formulate the model that can efficiently detect the emerging malware despite having limited exposure. The experimental results demonstrates that the proposed technique achieves an accuracy of 90% in detecting limitedly seen malware, which is approximately 3x more than the accuracy attained by state-of-the-art techniques.
翻译:嵌入式计算系统面临的关键安全威胁之一是恶意软件(即恶意程序)。近年来,机器学习凭借其高效性和有效性被广泛用于恶意软件检测。尽管现有技术较为高效,但仍需大量良性与恶意样本以训练和建模高性能检测器。此外,由于缺乏足够训练所需的恶意样本,此类限制进一步阻碍了新兴恶意软件的检测。为解决这些问题,本文提出一种代码感知数据生成技术,可针对设备中少量出现的恶意软件生成多种变异样本。通过损失最小化确保生成的样本紧密模拟受限恶意软件,并消除不切实际的样本。此类生成的样本进一步纳入训练集,以构建能够高效检测新兴恶意软件(即使仅接触少量样本)的模型。实验结果表明,所提技术对受限恶意软件的检测准确率达到90%,约为当前最优技术的三倍。