Yes. SE data can have "smoother" boundaries between classes (compared to traditional AI data sets). To be more precise, the magnitude of the second derivative of the loss function found in SE data is typically much smaller. A new hyper-parameter optimizer, called SMOOTHIE, can exploit this idiosyncrasy of SE data. We compare SMOOTHIE and a state-of-the-art AI hyper-parameter optimizer on three tasks: (a) GitHub issue lifetime prediction (b) detecting static code warnings false alarm; (c) defect prediction. For completeness, we also show experiments on some standard AI datasets. SMOOTHIE runs faster and predicts better on the SE data--but ties on non-SE data with the AI tool. Hence we conclude that SE data can be different to other kinds of data; and those differences mean that we should use different kinds of algorithms for our data. To support open science and other researchers working in this area, all our scripts and datasets are available on-line at https://github.com/yrahul3910/smoothness-hpo/.
翻译:是的。软件工程数据在类别之间可能具有更“平滑”的边界(相较于传统AI数据集)。更准确地说,在软件工程数据中发现的损失函数的二阶导数幅值通常要小得多。一种名为SMOOTHIE的新型超参数优化器能够利用软件工程数据的这一特性。我们在三项任务上比较了SMOOTHIE与一种先进的AI超参数优化器:(a) GitHub问题生命周期预测;(b) 静态代码警告误报检测;(c) 缺陷预测。为求完备,我们还展示了一些标准AI数据集上的实验结果。SMOOTHIE在软件工程数据上运行更快且预测更优——但在非软件工程数据上与AI工具表现相当。因此我们得出结论:软件工程数据可能与其他类型的数据不同;这些差异意味着我们应当针对自身数据使用不同类型的算法。为支持开放科学及该领域的其他研究者,我们所有的脚本和数据集均在线发布于 https://github.com/yrahul3910/smoothness-hpo/。