Effective and efficient malware detection is at the forefront of research into building secure digital systems. As with many other fields, malware detection research has seen a dramatic increase in the application of machine learning algorithms. One machine learning technique that has been used widely in the field of pattern matching in general-and malware detection in particular-is hidden Markov models (HMMs). HMM training is based on a hill climb, and hence we can often improve a model by training multiple times with different initial values. In this research, we compare boosted HMMs (using AdaBoost) to HMMs trained with multiple random restarts, in the context of malware detection. These techniques are applied to a variety of challenging malware datasets. We find that random restarts perform surprisingly well in comparison to boosting. Only in the most difficult "cold start" cases (where training data is severely limited) does boosting appear to offer sufficient improvement to justify its higher computational cost in the scoring phase.
翻译:有效的恶意软件检测是构建安全数字系统研究的核心。与许多其他领域一样,恶意软件检测研究中机器学习算法的应用显著增加。在模式匹配领域(尤其是恶意软件检测)得到广泛应用的机器学习技术之一是隐马尔可夫模型(HMM)。HMM的训练基于爬山算法,因此通常可以通过使用不同初始值多次训练来改进模型。在本研究中,我们在恶意软件检测背景下比较了Boosted HMM(使用AdaBoost)与采用多次随机重启训练的HMM。将这些技术应用于多种具有挑战性的恶意软件数据集。我们发现,与Boosting相比,随机重启表现出令人惊讶的良好性能。仅在最具挑战性的“冷启动”情况(即训练数据严重受限)中,Boosting才显示出足够的改进,以证明其在评分阶段更高的计算成本是合理的。