An Exploratory Study on Just-in-Time Multi-Programming-Language Bug Prediction

Context: An increasing number of software systems are written in multiple programming languages (PLs), which are called multi-programming-language (MPL) systems. MPL bugs (MPLBs) refers to the bugs whose resolution involves multiple PLs. Despite high complexity of MPLB resolution, there lacks MPLB prediction methods. Objective: This work aims to construct just-in-time (JIT) MPLB prediction models with selected prediction metrics, analyze the significance of the metrics, and then evaluate the performance of cross-project JIT MPLB prediction. Method: We develop JIT MPLB prediction models with the selected metrics using machine learning algorithms and evaluate the models in within-project and cross-project contexts with our constructed dataset based on 18 Apache MPL projects. Results: Random Forest is appropriate for JIT MPLB prediction. Changed LOC of all files, added LOC of all files, and the total number of lines of all files of the project currently are the most crucial metrics in JIT MPLB prediction. The prediction models can be simplified using a few top-ranked metrics. Training on the dataset from multiple projects can yield significantly higher AUC than training on the dataset from a single project for cross-project JIT MPLB prediction. Conclusions: JIT MPLB prediction models can be constructed with the selected set of metrics, which can be reduced to build simplified JIT MPLB prediction models, and cross-project JIT MPLB prediction is feasible.

翻译：背景：越来越多的软件系统采用多种编程语言（PL）编写，这类系统被称为多编程语言（MPL）系统。MPL缺陷（MPLBs）指其修复过程涉及多种编程语言的缺陷。尽管MPLBs的修复具有高度复杂性，但目前仍缺乏相应的MPLB预测方法。目标：本研究旨在利用选定的预测指标构建即时（JIT）MPLB预测模型，分析各指标的重要性，并评估跨项目JIT MPLB预测的性能。方法：我们使用机器学习算法结合选定指标开发JIT MPLB预测模型，并基于18个Apache MPL项目构建的数据集，在项目内与跨项目两种场景下评估模型性能。结果：随机森林算法适用于JIT MPLB预测。所有文件的变更代码行数、新增代码行数以及项目当前所有文件的总行数是最关键的预测指标。通过少量高权重指标可简化预测模型。对于跨项目JIT MPLB预测，使用多项目数据集训练的模型AUC值显著高于单项目数据集训练的模型。结论：可通过选定指标集构建JIT MPLB预测模型，该指标集可进一步精简以构建简化模型，且跨项目JIT MPLB预测具有可行性。

相关内容

MoDELS

关注 45

ACM/IEEE第23届模型驱动工程语言和系统国际会议，是模型驱动软件和系统工程的首要会议系列，由ACM-SIGSOFT和IEEE-TCSE支持组织。自1998年以来，模型涵盖了建模的各个方面，从语言和方法到工具和应用程序。模特的参加者来自不同的背景，包括研究人员、学者、工程师和工业专业人士。MODELS 2019是一个论坛，参与者可以围绕建模和模型驱动的软件和系统交流前沿研究成果和创新实践经验。今年的版本将为建模社区提供进一步推进建模基础的机会，并在网络物理系统、嵌入式系统、社会技术系统、云计算、大数据、机器学习、安全、开源等新兴领域提出建模的创新应用以及可持续性。官网链接：http://www.modelsconference.org/

【NeurIPS2021】用于文本图表示学习的 GNN 嵌套 Transformer 模型：GraphFormers

专知会员服务

46+阅读 · 2021年11月24日

Linux导论，Introduction to Linux，96页ppt

专知会员服务

82+阅读 · 2020年7月26日

【亚马逊-WWW2020】不解析,生成!用于面向任务的语义分析的序列到序列体系结构，Don't Parse, Generate! A Sequence to Sequence Architecture for Task-Oriented Semantic Parsing

专知会员服务

15+阅读 · 2020年2月1日

FlowQA: Grasping Flow in History for Conversational Machine Comprehension

专知会员服务

34+阅读 · 2019年10月18日