Predicting Software Performance with Divide-and-Learn

from arxiv, This paper has been accepted by The ACM Joint European Software Engineering Conference and Symposium on the Foundations of Software Engineering (ESEC/FSE), 2023

Predicting the performance of highly configurable software systems is the foundation for performance testing and quality assurance. To that end, recent work has been relying on machine/deep learning to model software performance. However, a crucial yet unaddressed challenge is how to cater for the sparsity inherited from the configuration landscape: the influence of configuration options (features) and the distribution of data samples are highly sparse. In this paper, we propose an approach based on the concept of 'divide-and-learn', dubbed $DaL$. The basic idea is that, to handle sample sparsity, we divide the samples from the configuration landscape into distant divisions, for each of which we build a regularized Deep Neural Network as the local model to deal with the feature sparsity. A newly given configuration would then be assigned to the right model of division for the final prediction. Experiment results from eight real-world systems and five sets of training data reveal that, compared with the state-of-the-art approaches, $DaL$ performs no worse than the best counterpart on 33 out of 40 cases (within which 26 cases are significantly better) with up to $1.94\times$ improvement on accuracy; requires fewer samples to reach the same/better accuracy; and producing acceptable training overhead. Practically, $DaL$ also considerably improves different global models when using them as the underlying local models, which further strengthens its flexibility. To promote open science, all the data, code, and supplementary figures of this work can be accessed at our repository: https://github.com/ideas-labo/DaL.

翻译：高可配置软件系统的性能预测是性能测试与质量保证的基础。为此，近年来的研究主要依赖机器学习/深度学习对软件性能进行建模。然而，一个关键但尚未解决的挑战是如何应对配置空间固有的稀疏性问题：配置选项（特征）的影响及数据样本的分布均呈现高度稀疏性。本文提出一种基于"分而治之"思想的方法，名为$DaL$。其核心思路是：为处理样本稀疏性，我们将配置空间中的样本划分为不同区域，针对每个区域构建正则化深度神经网络作为局部模型以应对特征稀疏性。对于新给定的配置，将被分配到对应的区域模型中进行最终预测。在八个真实系统及五组训练数据上的实验结果表明：与现有最优方法相比，$DaL$在40个测试案例中有33个表现不劣于最佳对比方法（其中26个案例显著更优），精度提升最高达$1.94\times$；达到相同或更高精度所需的样本更少；且训练开销可接受。实际应用中，当将$DaL$作为底层局部模型时，亦能显著增强不同全局模型的性能，进一步提升了其灵活性。为促进开放科学，本工作的所有数据、代码及补充图表均可通过以下仓库获取：https://github.com/ideas-labo/DaL。