Scalable Malware Family Classification Using Quantum Kernel Based Machine Learning

The classification of malware families is a key challenge in cybersecurity, which enables threat attribution, analysis of attack operations, and the formulation of effective defense strategies. Emerging malware samples are becoming increasingly structurally similar and obfuscated, making accurate multiclass classification challenging for traditional machine learning models, especially when deployed at scale. In this research, we propose a scalable Quantum Kernel-based Machine Learning (QKML) framework for malware family classification that addresses both accuracy and efficiency constraints. The proposed framework extracts structural features from executable files and uses a supervised Linear Discriminant Analysis (LDA) projection to generate a compact, class-aware representation well suited for quantum processing. The nonlinear relationships among malware families are captured using a fidelity-based quantum kernel built from parameterized quantum circuits. We use the Nyström approximation method to obtain a low-rank approximation of the quantum kernel, which enables effective multiclass classification via ridge regression and enables learning from all available training samples without incurring the quadratic computational cost of kernel matrix construction. The proposed model achieves strong classification performance, with 80.88% accuracy, outperforming classical machine learning baselines under identical feature and data splits, according to experimental evaluation on a large-scale malware dataset that includes 18,836 samples across 23 malware families. These findings suggest that scalable quantum-kernel-based machine learning can offer measurable performance advantages for real-world malware family classification tasks.

翻译：恶意软件家族分类是网络安全领域的关键挑战，有助于威胁溯源、攻击行为分析及制定有效防御策略。新兴恶意软件样本在结构上日趋相似且混淆程度提高，使得传统机器学习模型在多分类任务中面临挑战，尤其在大规模部署场景下更为困难。本研究提出一种基于可扩展量子核的机器学习（QKML）框架用于恶意软件家族分类，同时兼顾精度与效率约束。该框架从可执行文件中提取结构特征，通过有监督线性判别分析（LDA）投影生成紧凑且具备类别感知能力的表示，适用于量子处理。利用基于参数化量子线路构建的保真度量子核捕捉恶意软件家族间的非线性关系。采用Nyström近似方法获取量子核的低秩近似，通过岭回归实现高效多分类，并能在不产生核矩阵构建二次计算成本的前提下利用所有可用训练样本学习。实验评估基于包含18,836个样本（涵盖23个恶意软件家族）的大规模数据集，结果表明该模型在相同特征与数据划分条件下达到80.88%的分类准确率，优于传统机器学习基线方法。这些发现表明，可扩展量子核机器学习可为真实场景中的恶意软件家族分类任务提供可量化的性能优势。