Context: An increasing number of software systems are written in multiple programming languages (PLs), which are called multi-programming-language (MPL) systems. MPL bugs (MPLBs) refers to the bugs whose resolution involves multiple PLs. Despite high complexity of MPLB resolution, there lacks MPLB prediction methods. Objective: This work aims to construct just-in-time (JIT) MPLB prediction models with selected prediction metrics, analyze the significance of the metrics, and then evaluate the performance of cross-project JIT MPLB prediction. Method: We develop JIT MPLB prediction models with the selected metrics using machine learning algorithms and evaluate the models in within-project and cross-project contexts with our constructed dataset based on 18 Apache MPL projects. Results: Random Forest is appropriate for JIT MPLB prediction. Changed LOC of all files, added LOC of all files, and the total number of lines of all files of the project currently are the most crucial metrics in JIT MPLB prediction. The prediction models can be simplified using a few top-ranked metrics. Training on the dataset from multiple projects can yield significantly higher AUC than training on the dataset from a single project for cross-project JIT MPLB prediction. Conclusions: JIT MPLB prediction models can be constructed with the selected set of metrics, which can be reduced to build simplified JIT MPLB prediction models, and cross-project JIT MPLB prediction is feasible.
翻译:背景:越来越多的软件系统采用多种编程语言(PL)编写,这类系统被称为多编程语言(MPL)系统。MPL缺陷(MPLBs)指其修复过程涉及多种编程语言的缺陷。尽管MPLBs的修复具有高度复杂性,但目前仍缺乏相应的MPLB预测方法。目标:本研究旨在利用选定的预测指标构建即时(JIT)MPLB预测模型,分析各指标的重要性,并评估跨项目JIT MPLB预测的性能。方法:我们使用机器学习算法结合选定指标开发JIT MPLB预测模型,并基于18个Apache MPL项目构建的数据集,在项目内与跨项目两种场景下评估模型性能。结果:随机森林算法适用于JIT MPLB预测。所有文件的变更代码行数、新增代码行数以及项目当前所有文件的总行数是最关键的预测指标。通过少量高权重指标可简化预测模型。对于跨项目JIT MPLB预测,使用多项目数据集训练的模型AUC值显著高于单项目数据集训练的模型。结论:可通过选定指标集构建JIT MPLB预测模型,该指标集可进一步精简以构建简化模型,且跨项目JIT MPLB预测具有可行性。