Neural-FEBI: Accurate Function Identification in Ethereum Virtual Machine Bytecode

Millions of smart contracts have been deployed onto the Ethereum platform, posing potential attack subjects. Therefore, analyzing contract binaries is vital since their sources are unavailable, involving identification comprising function entry identification and detecting its boundaries. Such boundaries are critical to many smart contract applications, e.g. reverse engineering and profiling. Unfortunately, it is challenging to identify functions from these stripped contract binaries due to the lack of internal function call statements and the compiler-inducing instruction reshuffling. Recently, several existing works excessively relied on a set of handcrafted heuristic rules which impose several faults. To address this issue, we propose a novel neural network-based framework for EVM bytecode Function Entries and Boundaries Identification (neural-FEBI) that does not rely on a fixed set of handcrafted rules. Instead, it used a two-level bi-Long Short-Term Memory network and a Conditional Random Field network to locate the function entries. The suggested framework also devises a control flow traversal algorithm to determine the code segments reachable from the function entry as its boundary. Several experiments on 38,996 publicly available smart contracts collected as binary demonstrate that neural-FEBI confirms the lowest and highest F1-scores for the function entries identification task across different datasets of 88.3 to 99.7, respectively. Its performance on the function boundary identification task is also increased from 79.4% to 97.1% compared with state-of-the-art. We further demonstrate that the identified function information can be used to construct more accurate intra-procedural CFGs and call graphs. The experimental results confirm that the proposed framework significantly outperforms state-of-the-art, often based on handcrafted heuristic rules.

翻译：数百万份智能合约已部署至以太坊平台，构成了潜在的受攻击主体。因此，由于其源代码不可获取，对合约二进制文件进行分析至关重要，这涉及函数入口识别及其边界检测。这些边界对于诸多智能合约应用（例如逆向工程和性能分析）至关重要。然而，由于缺乏内部函数调用语句以及编译器导致的指令重排，从这些剥离的合约二进制文件中识别函数颇具挑战性。近期，一些现有工作过度依赖一组人工设计的启发式规则，这引入了多个缺陷。为解决此问题，我们提出了一种新颖的基于神经网络的EVM字节码函数入口与边界识别框架（neural-FEBI），该框架不依赖固定的人工设计规则集。反之，它采用双层双向长短期记忆网络与条件随机场网络来定位函数入口。该框架还设计了一种控制流遍历算法，以确定从函数入口可达的代码段作为其边界。对收集的38,996份公开可用的智能合约二进制文件进行的多项实验表明，neural-FEBI在不同数据集上函数入口识别任务的F1分数分别达到最低88.3%和最高99.7%；其在函数边界识别任务上的性能也从当前最优方法的79.4%提升至97.1%。我们进一步证明，识别出的函数信息可用于构建更精确的过程内控制流图和调用图。实验结果证实，所提框架显著优于当前通常基于人工设计启发式规则的最优方法。