Flaky tests are problematic because they non-deterministically pass or fail for the same software version under test, causing confusion and wasting developer time. While machine learning models have been used to predict flakiness and its root causes, there is less work on providing support to fix the problem. To address this gap, we propose a framework that automatically generates labeled datasets for 13 fix categories and train models to predict the fix category of a flaky test by analyzing the test code only. Though it is unrealistic at this stage to accurately predict the fix itself, the categories provide precise guidance about what part of the test code to look at. Our approach is based on language models, namely CodeBERT and UniXcoder, whose output is fine-tuned with a Feed Forward Neural Network (FNN) or a Siamese Network-based Few Shot Learning (FSL). Our experimental results show that UniXcoder outperforms CodeBERT, in correctly predicting most of the categories of fixes a developer should apply. Furthermore, FSL does not appear to have any significant effect. Given the high accuracy obtained for most fix categories, our proposed framework has the potential to help developers to fix flaky tests quickly and accurately.To aid future research, we make our automated labeling tool, dataset, prediction models, and experimental infrastructure publicly available.
翻译:脆性测试(Flaky Tests)存在问题,因为它们对同一被测试软件版本的表现具有非确定性(时而通过时而失败),这会造成混乱并浪费开发人员时间。尽管机器学习模型已被用于预测脆性及其根本原因,但在提供修复问题支持方面的研究较少。为弥补这一空白,我们提出一个框架,该框架能够自动生成涵盖13种修复类别的标注数据集,并训练模型仅通过分析测试代码来预测脆性测试的修复类别。尽管当前阶段准确预测具体修复方案尚不现实,但这些类别可为测试代码中需要关注的部分提供精确指导。我们的方法基于语言模型,即CodeBERT和UniXcoder,其输出通过前馈神经网络(FNN)或基于孪生网络的小样本学习(FSL)进行微调。实验结果表明,UniXcoder在正确预测开发者应采用的绝大多数修复类别方面优于CodeBERT。此外,FSL并未表现出显著效果。鉴于大多数修复类别取得了高准确率,我们提出的框架有潜力帮助开发者快速准确地修复脆性测试。为促进未来研究,我们公开提供自动化标注工具、数据集、预测模型及实验基础设施。