Blockchain, as a distributed ledger technology, becomes increasingly popular, especially for enabling valuable cryptocurrencies and smart contracts. However, the blockchain software systems inevitably have many bugs. Although bugs in smart contracts have been extensively investigated, security bugs of the underlying blockchain systems are much less explored. In this paper, we conduct an empirical study on blockchain's system vulnerabilities from four representative blockchains, Bitcoin, Ethereum, Monero, and Stellar. Specifically, we first design a systematic filtering process to effectively identify 1,037 vulnerabilities and their 2,317 patches from 34,245 issues/PRs (pull requests) and 85,164 commits on GitHub. We thus build the first blockchain vulnerability dataset. We then perform unique analyses of this dataset at three levels, including (i) file-level vulnerable module categorization by identifying and correlating module paths across projects, (ii) text-level vulnerability type clustering by natural language processing and similarity-based sentence clustering, and (iii) code-level vulnerability pattern analysis by generating and clustering code change signatures that capture both syntactic and semantic information of patch code fragments. Our analyses reveal three key findings: (i) some blockchain modules are more susceptible than the others; notably, each of the modules related to consensus, wallet, and networking has over 200 issues; (ii) about 70% of blockchain vulnerabilities are of traditional types, but we also identify four new types specific to blockchains; and (iii) we obtain 21 blockchain-specific vulnerability patterns that capture unique blockchain attributes and statuses, and demonstrate that they can be used to detect similar vulnerabilities in other popular blockchains, such as Dogecoin, Bitcoin SV, and Zcash.
翻译:区块链作为一种分布式账本技术,正日益流行,尤其是在赋能高价值加密货币和智能合约方面。然而,区块链软件系统不可避免地存在大量缺陷。尽管智能合约中的缺陷已被广泛研究,但底层区块链系统的安全缺陷却鲜有探讨。本文对比特币、以太坊、门罗币和恒星网络这四个代表性区块链的系统漏洞进行了实证研究。具体而言,我们首先设计了一套系统化的过滤流程,从GitHub上的34,245个问题/拉取请求和85,164次提交中有效识别出1,037个漏洞及其2,317个补丁,并由此构建了首个区块链漏洞数据集。随后,我们在三个层次上对该数据集进行了独特分析,包括:(i) 通过识别和关联项目间的模块路径进行文件级易受攻击模块分类;(ii) 通过自然语言处理和基于相似度的句子聚类进行文本级漏洞类型聚类;(iii) 通过生成并聚类捕捉补丁代码片段语法和语义信息的代码变更签名进行代码级漏洞模式分析。我们的分析揭示了三个关键发现:(i) 某些区块链模块比其他模块更易受攻击;特别是与共识、钱包和网络相关的模块各存在超过200个问题;(ii) 约70%的区块链漏洞属于传统类型,但我们也识别出四种区块链特有的新类型;(iii) 我们获得了21种区块链特定的漏洞模式,这些模式捕捉了区块链的独特属性和状态,并证明了它们可用于检测其他流行区块链(如狗狗币、比特币SV和Zcash)中的类似漏洞。