In this work, we begin to investigate the possibility of training a deep neural network on the task of binary code understanding. Specifically, the network would take, as input, features derived directly from binaries and output English descriptions of functionality to aid a reverse engineer in investigating the capabilities of a piece of closed-source software, be it malicious or benign. Given recent success in applying large language models (generative AI) to the task of source code summarization, this seems a promising direction. However, in our initial survey of the available datasets, we found nothing of sufficiently high quality and volume to train these complex models. Instead, we build our own dataset derived from a capture of Stack Overflow containing 1.1M entries. A major result of our work is a novel dataset evaluation method using the correlation between two distances on sample pairs: one distance in the embedding space of inputs and the other in the embedding space of outputs. Intuitively, if two samples have inputs close in the input embedding space, their outputs should also be close in the output embedding space. We found this Embedding Distance Correlation (EDC) test to be highly diagnostic, indicating that our collected dataset and several existing open-source datasets are of low quality as the distances are not well correlated. We proceed to explore the general applicability of EDC, applying it to a number of qualitatively known good datasets and a number of synthetically known bad ones and found it to be a reliable indicator of dataset value.
翻译:本研究初步探讨了训练深度神经网络理解二进制代码的可能性。具体而言,该网络以直接源自二进制文件的特征作为输入,输出功能描述的自然语言表述,旨在辅助逆向工程师分析闭源软件(无论是恶意还是良性)的功能。鉴于近期大型语言模型(生成式AI)在源代码摘要任务中的成功应用,这一方向展现出良好前景。然而,在现有数据集的初步调研中,我们发现尚无具备足够质量和规模的数据可用于训练此类复杂模型。为此,我们基于Stack Overflow抓取的110万条条目构建了专属数据集。本研究的一项核心成果是提出了新型数据集评估方法:通过样本对中两种距离(输入嵌入空间距离与输出嵌入空间距离)的相关性进行评估。直觉上,若两个样本的输入在输入嵌入空间中相近,其输出在输出嵌入空间中也应相近。实验表明,这种嵌入距离相关性(EDC)测试具有高度诊断性——我们收集的数据集及多个现有开源数据集均因距离相关性不足而被判定为低质量。我们进一步探索了EDC的普适性,将其应用于多个公认的高质量数据集与人工构造的低质量数据集,发现该方法可可靠地指示数据集价值。