This paper proposes a new challenge problem for software analytics. In the process we shall call "software review", a panel of SMEs (subject matter experts) review examples of software behavior to recommend how to improve that's software's operation. SME time is usually extremely limited so, ideally, this panel can complete this optimization task after looking at just a small number of very informative, examples. To support this review process, we explore methods that train a predictive model to guess if some oracle will like/dislike the next example. Such a predictive model can work with the SMEs to guide them in their exploration of all the examples. Also, after the panelists leave, that model can be used as an oracle in place of the panel (to handle new examples, while the panelists are busy, elsewhere). In 31 case studies (ranging from from high-level decisions about software processes to low-level decisions about how to configure video encoding software), we show that such predictive models can be built using as few as 12 to 30 labels. To the best of our knowledge, this paper's success with only a handful of examples (and no large language model) is unprecedented. In accordance with the principles of open science, we offer all our code and data at https://github.com/timm/ez/tree/Stable-EMSE-paper so that others can repeat/refute/improve these results.
翻译:本文提出软件分析领域的一个新挑战问题。在我们将称之为“软件评审”的过程中,由领域专家组成的评审小组审查软件行为案例,并就如何改进软件运行提出建议。领域专家的时间通常极为有限,因此理想情况下,该评审小组只需查看少量信息量丰富的案例即可完成优化任务。为支持这一评审流程,我们探索了多种方法,通过训练预测模型来预判某个评审器是否会对下一个案例给出正面/负面评价。此类预测模型可与领域专家协同工作,引导他们全面探索所有案例。此外,当评审小组离开后,该模型可作为评审器的替代方案(用于处理新案例,而此时评审专家正在其他事务中忙碌)。在31个案例研究(涵盖从软件流程的高层决策到视频编码软件配置的低层决策)中,我们证明仅需12至30个标注样本即可构建此类预测模型。据我们所知,本文在仅使用少量样本(且未借助大型语言模型)的情况下取得的成功是前所未有的。遵循开放科学原则,我们在https://github.com/timm/ez/tree/Stable-EMSE-paper 公开了全部代码与数据,以便研究者重复、反驳或改进这些结果。