In many domains, there are many examples and far fewer labels for those examples; e.g. we may have access to millions of lines of source code, but access to only a handful of warnings about that code. In those domains, semi-supervised learners (SSL) can extrapolate labels from a small number of examples to the rest of the data. Standard SSL algorithms use ``weak'' knowledge (i.e. those not based on specific SE knowledge) such as (e.g.) co-train two learners and use good labels from one to train the other. Another approach of SSL in software analytics is potentially use ``strong'' knowledge that use SE knowledge. For example, an often-used heuristic in SE is that unusually large artifacts contain undesired properties (e.g. more bugs). This paper argues that such ``strong'' algorithms perform better than those standard, weaker, SSL algorithms. We show this by learning models from labels generated using weak SSL or our ``stronger'' FRUGAL algorithm. In four domains (distinguishing security-related bug reports; mitigating bias in decision-making; predicting issue close time; and (reducing false alarms in static code warnings), FRUGAL required only 2.5% of the data to be labeled yet out-performed standard semi-supervised learners that relied on (e.g.) some domain-independent graph theory concepts. Hence, for future work, we strongly recommend the use of strong heuristics for semi-supervised learning for SE applications. To better support other researchers, our scripts and data are on-line at https://github.com/HuyTu7/FRUGAL.
翻译:在许多领域中,存在大量样本但仅有极少量标签(例如,我们可能获得数百万行源代码,但仅有少量关于该代码的警告)。在这些领域中,半监督学习器能够从少量样本中推断标签并应用于其余数据。标准半监督学习算法使用"弱"知识(即不基于特定软件工程知识),例如(如)协同训练两个学习器,利用其中一个学习器生成的优质标签来训练另一个。软件分析中的另一种半监督学习方法可能使用基于软件工程知识的"强"知识。例如,软件工程中常用的启发式规则认为异常庞大的工件往往包含不良属性(如更多缺陷)。本文论证此类"强"算法优于标准弱半监督学习算法。我们通过从弱半监督学习或本文提出的"更强"FRUGAL算法生成的标签中学习模型来证明这一点。在四个领域(区分安全相关缺陷报告、缓解决策偏差、预测问题关闭时间、减少静态代码告警的误报)中,FRUGAL仅需标注2.5%的数据,即可超越依赖(例如)与领域无关图论概念的标准半监督学习器。因此,对于未来研究,我们强烈建议在软件工程应用的半监督学习中使用强启发式。为更好地支持其他研究者,我们的脚本与数据可在https://github.com/HuyTu7/FRUGAL获取。