Binary code representation learning has shown significant performance in binary analysis tasks. But existing solutions often have poor transferability, particularly in few-shot and zero-shot scenarios where few or no training samples are available for the tasks. To address this problem, we present CLAP (Contrastive Language-Assembly Pre-training), which employs natural language supervision to learn better representations of binary code (i.e., assembly code) and get better transferability. At the core, our approach boosts superior transfer learning capabilities by effectively aligning binary code with their semantics explanations (in natural language), resulting a model able to generate better embeddings for binary code. To enable this alignment training, we then propose an efficient dataset engine that could automatically generate a large and diverse dataset comprising of binary code and corresponding natural language explanations. We have generated 195 million pairs of binary code and explanations and trained a prototype of CLAP. The evaluations of CLAP across various downstream tasks in binary analysis all demonstrate exceptional performance. Notably, without any task-specific training, CLAP is often competitive with a fully supervised baseline, showing excellent transferability. We release our pre-trained model and code at https://github.com/Hustcw/CLAP.
翻译:二进制代码表示学习在二进制分析任务中展现出显著性能。然而,现有解决方案往往迁移性较差,尤其是在小样本和零样本场景下——这些场景中任务可用的训练样本极少甚至为零。为解决该问题,我们提出CLAP(对比语言-汇编预训练),通过利用自然语言监督来学习更好的二进制代码(即汇编代码)表示,从而获得更优的迁移性。其核心在于,我们的方法通过有效对齐二进制代码与其语义解释(以自然语言形式),显著提升了迁移学习能力,最终生成能够为二进制代码产生更优嵌入的模型。为实现这种对齐训练,我们进一步提出一种高效的数据集引擎,可自动生成包含二进制代码及其对应自然语言解释的大规模多样化数据集。我们已生成1.95亿对二进制代码与解释,并训练了CLAP原型。在二进制分析各类下游任务中的评估均表明CLAP表现出卓越性能。值得注意的是,无需任何任务特定训练,CLAP即可与完全监督基线模型相匹敌,展现出优异的迁移性。我们在https://github.com/Hustcw/CLAP 开放了预训练模型与代码。