Flaky tests, which pass or fail inconsistently without code changes, are a major challenge in software engineering in general and in quantum software engineering in particular due to their complexity and probabilistic nature, leading to hidden issues and wasted developer effort. We aim to create an automated framework to detect flaky tests in quantum software and an extended dataset of quantum flaky tests, overcoming the limitations of manual methods. Building on prior manual analysis of 14 quantum software repositories, we expanded the dataset and automated flaky test detection using transformers and cosine similarity. We conducted experiments with Large Language Models (LLMs) from the OpenAI GPT and Meta LLaMA families to assess their ability to detect and classify flaky tests from code and issue descriptions. Embedding transformers proved effective: we identified 25 new flaky tests, expanding the dataset by 54%. Top LLMs achieved an F1-score of 0.8871 for flakiness detection but only 0.5839 for root cause identification. We introduced an automated flaky test detection framework using machine learning, showing promising results but highlighting the need for improved root cause detection and classification in large quantum codebases. Future work will focus on improving detection techniques and developing automatic flaky test fixes.
翻译:不稳定测试指在不修改代码的情况下结果时过时败的测试,这是软件工程领域普遍面临的重大挑战,在量子软件工程中尤为突出——由于其复杂性和概率特性,这类测试会导致隐藏问题并浪费开发人员精力。本研究旨在构建自动化框架以检测量子软件中的不稳定测试,并扩展量子不稳定测试数据集,从而克服人工方法的局限性。基于先前对14个量子软件仓库的人工分析,我们通过Transformer模型和余弦相似度方法扩展了数据集并实现了不稳定测试的自动检测。我们使用OpenAI GPT系列和Meta LLaMA系列的大型语言模型进行实验,评估其从代码和问题描述中检测与分类不稳定测试的能力。嵌入Transformer模型被证明具有显著效果:我们新识别出25个不稳定测试,使数据集规模扩大54%。顶尖LLM在不稳定检测任务中取得了0.8871的F1分数,但在根因识别方面仅获得0.5839分。我们提出了基于机器学习的自动化不稳定测试检测框架,虽展现出良好前景,但也凸显了在大型量子代码库中改进根因检测与分类的必要性。未来工作将聚焦于优化检测技术并开发自动化不稳定测试修复方案。