Predicting Software Performance with Divide-and-Learn

from arxiv, This paper has been accepted by The ACM Joint European Software Engineering Conference and Symposium on the Foundations of Software Engineering (ESEC/FSE) 2023

Predicting the performance of highly configurable software systems is the foundation for performance testing and quality assurance. To that end, recent work has been relying on machine/deep learning to model software performance. However, a crucial yet unaddressed challenge is how to cater for the sparsity inherited from the configuration landscape: the influence of configuration options (features) and the distribution of data samples are highly sparse. In this paper, we propose an approach based on the concept of 'divide-and-learn', dubbed DaL. The basic idea is that, to handle sample sparsity, we divide the samples from the configuration landscape into distant divisions, for each of which we build a regularized Deep Neural Network as the local model to deal with the feature sparsity. A newly given configuration would then be assigned to the right model of division for the final prediction. Experiment results from eight real-world systems and five sets of training data reveal that, compared with the state-of-the-art approaches, DaL performs no worse than the best counterpart on 33 out of 40 cases (within which 26 cases are significantly better) with up to 1.94x improvement on accuracy; requires fewer samples to reach the same/better accuracy; and producing acceptable training overhead. Practically, DaL also considerably improves different global models when using them as the underlying local models, which further strengthens its flexibility. To promote open science, all the data, code, and supplementary figures of this work can be accessed at our repository: https://github.com/ideas-labo/DaL.

翻译：预测高度可配置软件系统的性能是性能测试与质量保证的基础。为此，近年来的研究依赖机器学习/深度学习对软件性能进行建模。然而，一个关键但尚未解决的挑战是如何应对配置空间固有的稀疏性：配置选项（特征）的影响以及数据样本的分布高度稀疏。本文提出一种基于“分而治之”概念的方法，命名为DaL。其核心思想是，为处理样本稀疏性，我们将配置空间中的样本划分为多个远距离分区，并为每个分区构建正则化深度神经网络作为局部模型，以应对特征稀疏性。对于新给定的配置，将分配至正确的分区模型进行最终预测。在八个真实系统与五组训练数据上的实验结果表明，与当前最先进的方法相比，DaL在40个案例中有33个（其中26个案例显著更优）表现不逊于最优对比方法，准确率提升高达1.94倍；在达到相同/更优精度时所需的样本更少；且训练开销可接受。实际应用中，DaL在使用不同全局模型作为底层局部模型时也能显著提升其性能，进一步增强了灵活性。为促进开放科学，本研究所有数据、代码及补充图表均可通过仓库获取：https://github.com/ideas-labo/DaL。