Compared with binaries and decompiled code, malware source code more directly reflects the attackers' original intent. However, the scarcity of source code and the high cost of manual review make such datasets difficult to build and maintain. We propose MASCOT-Android, a curated dataset of Android malware source code and an automated collection framework for scalable malware source code discovery on GitHub. A key finding of our work is that repository-level documentation alone provides a strong signal for malware source code collection. Our model extracts character-level TF-IDF features from 8,772 malware and 25,747 benign README documents and trains a LinearSVC classifier to distinguish malware repositories. This README-only model achieves an accuracy of 96.28\% and an FPR of 1.06\% in local evaluation. In addition, the model outputs confidence scores, allowing users to adjust the decision threshold to balance FPR and coverage, which is practical in real-world malware source code collection.
翻译:与二进制文件和反编译代码相比,恶意软件源代码更能直接反映攻击者的原始意图。然而,源代码的稀缺性以及人工审查的高昂成本,使得此类数据集的构建与维护面临重重困难。本文提出MASCOT-Android,一个精选的安卓恶意软件源代码数据集及其自动化采集框架,旨在实现GitHub上可扩展的恶意软件源代码发现。本研究的一个关键发现是,仅凭仓库级别的文档信息就能为恶意软件源代码采集提供强有力的信号。我们的模型从8,772份恶意软件和25,747份良性软件的README文档中提取字符级别的TF-IDF特征,并训练一个LinearSVC分类器来区分恶意软件仓库。该仅基于README的模型在本地评估中达到了96.28%的准确率和1.06%的假阳性率(FPR)。此外,模型输出的置信度分数允许用户调整决策阈值以平衡FPR与覆盖率,这在现实世界的恶意软件源代码采集场景中具有实用价值。