Static analysis tools have gained popularity among developers for finding potential bugs, but their widespread adoption is hindered by the accomnpanying high false alarm rates (up to 90%). To address this challenge, previous studies proposed the concept of actionable warnings, and apply machine-learning methods to distinguish actionable warnings from false alarms. Despite these efforts, our preliminary study suggests that the current methods used to collect actionable warnings are rather shaky and unreliable, resulting in a large proportion of invalid actionable warnings. In this work, we mined 68,274 reversions from Top-500 Github C repositories to create a substantia actionable warning dataset and assigned weak labels to each warning's likelihood of being a real bug. To automatically identify actionable warnings and recommend those with a high probability of being real bugs (AWHB), we propose a two-stage framework called ACWRecommender. In the first stage, our tool use a pre-trained model, i.e., UniXcoder, to identify actionable warnings from a huge number of SA tool's reported warnings. In the second stage, we rerank valid actionable warnings to the top by using weakly supervised learning. Experimental results showed that our tool outperformed several baselines for actionable warning detection (in terms of F1-score) and performed better for AWHB recommendation (in terms of nDCG and MRR). Additionaly, we also performed an in-the-wild evaluation, we manually validated 24 warnings out of 2,197 reported warnings on 10 randomly selected projects, 22 of which were confirmed by developers as real bugs, demonstrating the practical usage of our tool.
翻译:静态分析工具在开发者中因能发现潜在缺陷而日益流行,但其广泛采用仍受限于伴随的高误报率(高达90%)。为应对这一挑战,先前研究提出了可操作告警的概念,并应用机器学习方法区分可操作告警与误报。尽管已有这些努力,我们的初步研究表明,当前用于收集可操作告警的方法相当不稳定且不可靠,导致大量无效可操作告警的出现。在本工作中,我们从Top-500 Github C语言仓库中挖掘了68,274次回退提交,构建了一个大规模可操作告警数据集,并为每条告警的真实缺陷可能性分配了弱标签。为自动识别可操作告警并推荐高概率真实缺陷告警(AWHB),我们提出了一种名为ACWRecommender的两阶段框架。第一阶段,我们的工具利用预训练模型UniXcoder,从SA工具报告的海量告警中识别可操作告警;第二阶段,通过弱监督学习将有效可操作告警重排至前列。实验结果表明,我们的工具在可操作告警检测(F1-score)上优于多个基线方法,并在AWHB推荐(nDCG和MRR)中表现更佳。此外,我们还进行了野外环境评估:在10个随机项目中共验证了2,197条报告告警中的24条,其中22条被开发者确认为真实缺陷,验证了本工具的实际应用价值。