Machine/deep learning models have been widely adopted for predicting the configuration performance of software systems. However, a crucial yet unaddressed challenge is how to cater for the sparsity inherited from the configuration landscape: the influence of configuration options (features) and the distribution of data samples are highly sparse. In this paper, we propose a model-agnostic and sparsity-robust framework for predicting configuration performance, dubbed DaL, based on the new paradigm of dividable learning that builds a model via "divide-and-learn". To handle sample sparsity, the samples from the configuration landscape are divided into distant divisions, for each of which we build a sparse local model, e.g., regularized Hierarchical Interaction Neural Network, to deal with the feature sparsity. A newly given configuration would then be assigned to the right model of division for the final prediction. Further, DaL adaptively determines the optimal number of divisions required for a system and sample size without any extra training or profiling. Experiment results from 12 real-world systems and five sets of training data reveal that, compared with the state-of-the-art approaches, DaL performs no worse than the best counterpart on 44 out of 60 cases with up to 1.61x improvement on accuracy; requires fewer samples to reach the same/better accuracy; and producing acceptable training overhead. In particular, the mechanism that adapted the parameter d can reach the optimal value for 76.43% of the individual runs. The result also confirms that the paradigm of dividable learning is more suitable than other similar paradigms such as ensemble learning for predicting configuration performance. Practically, DaL considerably improves different global models when using them as the underlying local models, which further strengthens its flexibility.
翻译:机器学习/深度学习模型已被广泛应用于预测软件系统的配置性能。然而,一个关键但尚未解决的挑战是如何应对配置空间中固有的稀疏性问题:配置选项(特征)的影响和数据样本的分布均呈现高度稀疏性。本文提出了一种与模型无关且对稀疏性鲁棒的配置性能预测框架,称为DaL,其基于"可分式学习"新范式,通过"分而治之"的方式构建模型。为处理样本稀疏性,我们将配置空间中的样本划分为若干远距离分区,并为每个分区构建一个稀疏局部模型(例如正则化层次交互神经网络)以应对特征稀疏性。新给定的配置随后将被分配至相应分区的模型进行最终预测。此外,DaL能自适应地确定系统所需的最优分区数量及样本规模,无需额外训练或性能分析。在12个真实系统及五组训练数据上的实验结果表明:与现有最优方法相比,DaL在60个案例中的44个案例上表现不逊于最佳对比方法,准确率最高提升达1.61倍;达到相同/更优准确率所需样本更少;且产生可接受的训练开销。特别地,自适应调整参数d的机制在76.43%的独立运行中能达到最优值。结果同时证实,对于配置性能预测任务,可分式学习范式比集成学习等其他类似范式更为适用。实践表明,当采用不同全局模型作为底层局部模型时,DaL均能显著提升其性能,这进一步强化了其灵活性。