Neural-FEBI: Accurate Function Identification in Ethereum Virtual Machine Bytecode

Millions of smart contracts have been deployed onto the Ethereum platform, posing potential attack subjects. Therefore, analyzing contract binaries is vital since their sources are unavailable, involving identification comprising function entry identification and detecting its boundaries. Such boundaries are critical to many smart contract applications, e.g. reverse engineering and profiling. Unfortunately, it is challenging to identify functions from these stripped contract binaries due to the lack of internal function call statements and the compiler-inducing instruction reshuffling. Recently, several existing works excessively relied on a set of handcrafted heuristic rules which impose several faults. To address this issue, we propose a novel neural network-based framework for EVM bytecode Function Entries and Boundaries Identification (neural-FEBI) that does not rely on a fixed set of handcrafted rules. Instead, it used a two-level bi-Long Short-Term Memory network and a Conditional Random Field network to locate the function entries. The suggested framework also devises a control flow traversal algorithm to determine the code segments reachable from the function entry as its boundary. Several experiments on 38,996 publicly available smart contracts collected as binary demonstrate that neural-FEBI confirms the lowest and highest F1-scores for the function entries identification task across different datasets of 88.3 to 99.7, respectively. Its performance on the function boundary identification task is also increased from 79.4% to 97.1% compared with state-of-the-art. We further demonstrate that the identified function information can be used to construct more accurate intra-procedural CFGs and call graphs. The experimental results confirm that the proposed framework significantly outperforms state-of-the-art, often based on handcrafted heuristic rules.

翻译：数百万智能合约已部署至以太坊平台，构成了潜在的攻击目标。因此，在合约源码不可获取的情况下，分析合约二进制文件至关重要，这涉及函数入口识别及边界检测两个子任务。函数边界对于逆向工程、性能分析等众多智能合约应用具有关键意义。然而，由于缺乏内部函数调用语句且编译器会引发指令重排，从经过剥离的合约二进制文件中识别函数极具挑战性。近期部分研究过度依赖手工构建的启发式规则，导致误判频发。为解决该问题，本文提出一种基于神经网络的以太坊虚拟机字节码函数入口及边界识别框架（neural-FEBI），该框架不依赖固定手工规则，而是采用双层双向长短期记忆网络与条件随机场网络定位函数入口。所提框架还设计了控制流遍历算法，将函数入口可到达的代码段确定为函数边界。基于收集的38,996个公开智能合约二进制样本开展的多组实验表明：在不同数据集上，neural-FEBI在函数入口识别任务中的F1分数最低为88.3%，最高达99.7%；在函数边界识别任务中，其性能相较现有最优方法从79.4%提升至97.1%。我们进一步证明，识别出的函数信息可用于构建更精确的过程内控制流图与调用图。实验结果证实，所提框架显著优于基于手工启发式规则的现有最优方法。