The practice of code reuse is crucial in software development for a faster and more efficient development lifecycle. In reality, however, code reuse practices lack proper control, resulting in issues such as vulnerability propagation and intellectual property infringements. Assembly clone search, a critical shift-right defence mechanism, has been effective in identifying vulnerable code resulting from reuse in released executables. Recent studies on assembly clone search demonstrate a trend towards using machine learning-based methods to match assembly code variants produced by different toolchains. However, these methods are limited to what they learn from a small number of toolchain variants used in training, rendering them inapplicable to unseen architectures and their corresponding compilation toolchain variants. This paper presents the first study on the problem of assembly clone search with unseen architectures and libraries. We propose incorporating human common knowledge through large-scale pre-trained natural language models, in the form of transfer learning, into current learning-based approaches for assembly clone search. Transfer learning can aid in addressing the limitations of the existing approaches, as it can bring in broader knowledge from human experts in assembly code. We further address the sequence limit issue by proposing a reinforcement learning agent to remove unnecessary and redundant tokens. Coupled with a new Variational Information Bottleneck learning strategy, the proposed system minimizes the reliance on potential indicators of architectures and optimization settings, for a better generalization of unseen architectures. We simulate the unseen architecture clone search scenarios and the experimental results show the effectiveness of the proposed approach against the state-of-the-art solutions.
翻译:代码复用实践对于实现更快速、更高效的软件开发周期至关重要。然而在实际中,代码复用缺乏有效管控,导致漏洞传播和知识产权侵权等问题。汇编克隆搜索作为一种关键的右移防御机制,能有效识别可执行文件中因复用而产生的漏洞代码。近期汇编克隆搜索研究呈现利用基于机器学习方法匹配不同工具链生成的汇编代码变体的趋势。然而,这些方法受限于训练阶段少量工具链变体所习得的知识,无法应用于未见过的架构及其对应编译工具链变体。本文首次研究了面向未知架构与库的汇编克隆搜索问题。我们提出通过大规模预训练自然语言模型融入人类通用知识,以迁移学习的形式赋能现有基于学习的汇编克隆搜索方法。迁移学习能够引入汇编代码领域专家更广泛的知识,从而弥补现有方法的局限性。我们进一步提出采用强化学习智能体剔除冗余标记,以解决序列长度限制问题。结合新型变分信息瓶颈学习策略,所提系统最小化对架构特征和优化设置潜在指示信息的依赖,从而提升对未知架构的泛化能力。通过模拟未知架构克隆搜索场景的实验,结果表明所提方法相对于当前最先进解决方案具有有效性。