Binary malware summarization aims to automatically generate human-readable descriptions of malware behaviors from executable files, facilitating tasks like malware cracking and detection. Previous methods based on Large Language Models (LLMs) have shown great promise. However, they still face significant issues, including poor usability, inaccurate explanations,and incomplete summaries, primarily due to the obscure pseudocode structure and the lack of malware training summaries. Further, calling relationships between functions, which involve the rich interactions within a binary malware, remain largely underexplored. To this end, we propose MALSIGHT, a novel code summarization framework that can iteratively generate descriptions of binary malware by exploring malicious source code and benign pseudocode. Specifically, we construct the first malware summary dataset, MalS and MalP, using an LLM and manually refine this dataset with human effort. At the training stage, we tune our proposed MalT5, a novel LLM-based code model, on the MalS and benign pseudocode datasets. Then, at the test stage, we iteratively feed the pseudocode functions into MalT5 to obtain the summary. Such a procedure facilitates the understanding of pseudocode structure and captures the intricate interactions between functions, thereby benefiting summaries' usability, accuracy, and completeness. Additionally, we propose a novel evaluation benchmark, BLEURT-sum, to measure the quality of summaries. Experiments on three datasets show the effectiveness of the proposed MALSIGHT. Notably, our proposed MalT5, with only 0.77B parameters, delivers comparable performance to much larger Code-Llama.
翻译:二进制恶意软件摘要生成旨在从可执行文件中自动生成人类可读的恶意软件行为描述,以辅助恶意软件分析与检测等任务。基于大语言模型(LLMs)的现有方法已展现出巨大潜力,但仍面临可用性差、解释不准确及摘要不完整等显著问题,其主要原因在于伪代码结构晦涩难懂且缺乏恶意软件训练摘要。此外,涉及二进制恶意软件内部丰富交互的函数调用关系尚未得到充分探索。为此,我们提出MALSIGHT——一种通过探索恶意源代码与良性伪代码来迭代生成二进制恶意软件描述的新型代码摘要框架。具体而言,我们利用大语言模型构建了首个恶意软件摘要数据集MalS与MalP,并通过人工标注进行精细化处理。在训练阶段,我们基于MalS与良性伪代码数据集对提出的新型大语言模型代码模型MalT5进行微调。在测试阶段,通过迭代输入伪代码函数至MalT5以获取摘要。该流程有助于理解伪代码结构并捕捉函数间复杂的交互关系,从而提升摘要的可用性、准确性与完整性。此外,我们提出了新型评估基准BLEURT-sum以量化摘要质量。在三个数据集上的实验验证了MALSIGHT的有效性。值得注意的是,我们提出的MalT5仅含0.77B参数,其性能即可与规模更大的Code-Llama模型相媲美。