Predicting Software Performance with Divide-and-Learn

from arxiv, This paper has been accepted by The ACM Joint European Software Engineering Conference and Symposium on the Foundations of Software Engineering (ESEC/FSE), 2023

Predicting the performance of highly configurable software systems is the foundation for performance testing and quality assurance. To that end, recent work has been relying on machine/deep learning to model software performance. However, a crucial yet unaddressed challenge is how to cater for the sparsity inherited from the configuration landscape: the influence of configuration options (features) and the distribution of data samples are highly sparse. In this paper, we propose an approach based on the concept of 'divide-and-learn', dubbed $DaL$. The basic idea is that, to handle sample sparsity, we divide the samples from the configuration landscape into distant divisions, for each of which we build a regularized Deep Neural Network as the local model to deal with the feature sparsity. A newly given configuration would then be assigned to the right model of division for the final prediction. Experiment results from eight real-world systems and five sets of training data reveal that, compared with the state-of-the-art approaches, $DaL$ performs no worse than the best counterpart on 33 out of 40 cases (within which 26 cases are significantly better) with up to $1.94\times$ improvement on accuracy; requires fewer samples to reach the same/better accuracy; and producing acceptable training overhead. Practically, $DaL$ also considerably improves different global models when using them as the underlying local models, which further strengthens its flexibility. To promote open science, all the data, code, and supplementary figures of this work can be accessed at our repository: https://github.com/ideas-labo/DaL.

翻译：高度可配置软件系统的性能预测是性能测试与质量保证的基础。为此，近期研究依赖机器学习/深度学习对软件性能进行建模。然而，一个关键但尚未解决的挑战是如何应对配置空间固有的稀疏性：配置选项（特征）的影响及数据样本分布均高度稀疏。本文提出一种基于"分而治之"概念的方法，命名为$DaL$。其核心思想是：为处理样本稀疏性，我们将配置空间中的样本划分为不同区域，针对每个区域构建正则化深度神经网络作为局部模型以应对特征稀疏性。新给定的配置将被分配至所属区域的正确模型进行最终预测。在八个真实系统及五组训练数据上的实验结果表明：与现有最优方法相比，$DaL$在40个案例中有33个案例性能不低于最优对比方法（其中26个案例显著更优），准确率提升最高达$1.94\times$；达到相同/更优准确率所需样本更少；训练开销可接受。实际应用中，$DaL$在将其作为底层局部模型时还能显著改进各类全局模型，进一步增强了灵活性。为促进开放科学，本工作的所有数据、代码及补充图表均可通过以下仓库获取：https://github.com/ideas-labo/DaL。