Context:Smart contracts are prone to numerous security threats due to undisclosed vulnerabilities and code weaknesses. In Ethereum smart contracts, the challenges of timely addressing these code weaknesses highlight the critical need for automated early prediction and prioritization during the code review process. Efficient prioritization is crucial for smart contract security. Objective:Toward this end, our research aims to provide an automated approach, PrAIoritize, for prioritizing and predicting critical code weaknesses in Ethereum smart contracts during the code review process. Method: To do so, we collected smart contract code reviews sourced from Open Source Software (OSS) on GitHub and the Common Vulnerabilities and Exposures (CVE) database. Subsequently, we developed PrAIoritize, an innovative automated prioritization approach. PrAIoritize integrates advanced Large Language Models (LLMs) with sophisticated natural language processing (NLP) techniques. PrAIoritize automates code review labeling by employing a domain-specific lexicon of smart contract weaknesses and their impacts. Following this, feature engineering is conducted for code reviews, and a pre-trained DistilBERT model is utilized for priority classification. Finally, the model is trained and evaluated using code reviews of smart contracts. Results: Our evaluation demonstrates significant improvement over state-of-the-art baselines and commonly used pre-trained models (e.g. T5) for similar classification tasks, with 4.82\%-27.94\% increase in F-measure, precision, and recall. Conclusion: By leveraging PrAIoritize, practitioners can efficiently prioritize smart contract code weaknesses, addressing critical code weaknesses promptly and reducing the time and effort required for manual triage.
翻译:摘要:背景:智能合约因未公开的漏洞和代码缺陷面临着众多安全威胁。在以太坊智能合约中,及时处理这些代码缺陷的挑战凸显了代码审查过程中自动化早期预测与优先级排序的关键需求。高效的优先级排序对于智能合约安全至关重要。目标:为此,本研究旨在提供一种名为PrAIoritize的自动化方法,用于在以太坊智能合约代码审查过程中对关键代码缺陷进行优先级排序和预测。方法:为实现此目标,我们收集了来自GitHub开源软件及通用漏洞披露数据库的智能合约代码审查记录。随后,我们开发了PrAIoritize这一创新性自动化优先级排序方法。PrAIoritize将先进的大语言模型与复杂的自然语言处理技术相结合,通过使用智能合约缺陷及其影响的领域专用词表,实现代码审查标注的自动化。在此基础上,对代码审查进行特征工程,并利用预训练的DistilBERT模型进行优先级分类。最后,使用智能合约代码审查对模型进行训练与评估。结果:评估结果表明,相较于同类分类任务中的最新基线模型及常用预训练模型(如T5),本方法在F值、精确率和召回率上实现了4.82%至27.94%的提升。结论:借助PrAIoritize,从业者可高效地对智能合约代码缺陷进行优先级排序,及时处理关键代码缺陷,并减少人工分类所需的时间与工作量。