Artificial intelligence and machine learning have significantly advanced malware research by enabling automated threat detection and behavior analysis. However, the availability of exploitable data is limited, due to the absence of large datasets with real-world data. Despite the progress of AI in cybersecurity, malware analysis still suffers from this data scarcity, which limits model generalization. In order to tackle this difficulty, this workinvestigates TabPFN, a learning-free model designed for low-data regimes. We evaluate its performance against established baselines such as Random Forest, LightGBM and XGBoost, across multiple class configurations. Our experimental results indicate that TabPFN surpasses all other models in low-data regimes, with a 2% to 6% improvement observed across multiple performance metrics. However, this increase in performance has an impact on its computation time in a particular case. These findings highlight both the promise and the practical limitations of integrating TabPFN into cybersecurity workflows.
翻译:人工智能与机器学习通过实现自动化威胁检测与行为分析,显著推动了恶意软件研究的发展。然而,由于缺乏包含真实世界数据的大规模数据集,可利用数据的获取受到限制。尽管人工智能在网络安全领域取得了进展,恶意软件分析仍受限于数据稀缺问题,这制约了模型的泛化能力。为应对这一挑战,本研究探讨了TabPFN——一种专为低数据场景设计的免学习模型。我们在多种类别配置下,将其性能与随机森林、LightGBM和XGBoost等成熟基线模型进行比较评估。实验结果表明,在低数据条件下TabPFN优于所有其他模型,在多项性能指标上实现了2%至6%的提升。然而,这种性能提升在特定情况下会对其计算时间产生影响。这些发现既揭示了将TabPFN整合到网络安全工作流程中的潜力,也指出了其实际应用中的局限性。