Machine learning is often used for malicious website detection, but an approach incorporating WebAssembly as a feature has not been explored due to a limited number of samples, to the best of our knowledge. In this paper, we propose JABBERWOCK (JAvascript-Based Binary EncodeR by WebAssembly Optimization paCKer), a tool to generate WebAssembly datasets in a pseudo fashion via JavaScript. Loosely speaking, JABBERWOCK automatically gathers JavaScript code in the real world, convert them into WebAssembly, and then outputs vectors of the WebAssembly as samples for malicious website detection. We also conduct experimental evaluations of JABBERWOCK in terms of the processing time for dataset generation, comparison of the generated samples with actual WebAssembly samples gathered from the Internet, and an application for malicious website detection. Regarding the processing time, we show that JABBERWOCK can construct a dataset in 4.5 seconds per sample for any number of samples. Next, comparing 10,000 samples output by JABBERWOCK with 168 gathered WebAssembly samples, we believe that the generated samples by JABBERWOCK are similar to those in the real world. We then show that JABBERWOCK can provide malicious website detection with 99\% F1-score because JABBERWOCK makes a gap between benign and malicious samples as the reason for the above high score. We also confirm that JABBERWOCK can be combined with an existing malicious website detection tool to improve F1-scores. JABBERWOCK is publicly available via GitHub (https://github.com/c-chocolate/Jabberwock).
翻译:机器学习常被用于恶意网站检测,但据我们所知,由于样本数量有限,尚未探索将WebAssembly作为特征的方法。本文提出JABBERWOCK(基于JavaScript的WebAssembly优化打包二进制编码器),一种通过JavaScript以伪方式生成WebAssembly数据集的工具。粗略而言,JABBERWOCK自动收集真实世界的JavaScript代码,将其转换为WebAssembly,并输出WebAssembly的向量作为恶意网站检测的样本。我们还从数据集生成的处理时间、生成样本与互联网收集的实际WebAssembly样本的对比,以及恶意网站检测应用三个方面对JABBERWOCK进行了实验评估。在处理时间方面,我们证明JABBERWOCK能以每样本4.5秒的速度构建任意数量的数据集。接着,将JABBERWOCK输出的10,000个样本与收集的168个WebAssembly样本对比,我们认为JABBERWOCK生成的样本与真实世界样本相似。随后我们证明JABBERWOCK能提供99%的F1分数用于恶意网站检测,因为JABBERWOCK在良性样本与恶意样本之间制造了差异——这也是上述高评分的原因。我们还确认JABBERWOCK可与现有恶意网站检测工具结合以提升F1分数。JABBERWOCK已通过GitHub(https://github.com/c-chocolate/Jabberwock)公开发布。