Do Automatic Test Generation Tools Generate Flaky Tests?

Non-deterministic test behavior, or flakiness, is common and dreaded among developers. Researchers have studied the issue and proposed approaches to mitigate it. However, the vast majority of previous work has only considered developer-written tests. The prevalence and nature of flaky tests produced by test generation tools remain largely unknown. We ask whether such tools also produce flaky tests and how these differ from developer-written ones. Furthermore, we evaluate mechanisms that suppress flaky test generation. We sample 6 356 projects written in Java or Python. For each project, we generate tests using EvoSuite (Java) and Pynguin (Python), and execute each test 200 times, looking for inconsistent outcomes. Our results show that flakiness is at least as common in generated tests as in developer-written tests. Nevertheless, existing flakiness suppression mechanisms implemented in EvoSuite are effective in alleviating this issue (71.7 % fewer flaky tests). Compared to developer-written flaky tests, the causes of generated flaky tests are distributed differently. Their non-deterministic behavior is more frequently caused by randomness, rather than by networking and concurrency. Using flakiness suppression, the remaining flaky tests differ significantly from any flakiness previously reported, where most are attributable to runtime optimizations and EvoSuite-internal resource thresholds. These insights, with the accompanying dataset, can help maintainers to improve test generation tools, give recommendations for developers using these tools, and serve as a foundation for future research in test flakiness or test generation.

翻译：非确定性测试行为（即易变性）在开发者中普遍存在且令人困扰。研究人员已研究该问题并提出缓解方法，但此前绝大多数工作仅关注开发者编写的测试。测试生成工具产生的易变测试的普遍性和本质仍属未知。我们探究此类工具是否也会生成易变测试，及其与开发者编写测试的差异。此外，我们评估了抑制易变测试生成的机制。我们抽样了6 356个Java或Python项目，使用EvoSuite（Java）和Pynguin（Python）为每个项目生成测试，并对每个测试执行200次以观察结果不一致现象。结果表明：生成测试中的易变性至少与开发者编写测试同样常见。但EvoSuite现有易变抑制机制可有效缓解该问题（减少71.7%的易变测试）。与开发者编写的易变测试相比，生成测试中易变原因的分布存在差异——其非确定性行为更常由随机性引发，而非网络或并发问题。使用易变抑制后，残留的易变测试与以往任何已报道的易变性存在显著差异，其中多数可归因于运行时优化及EvoSuite内部资源阈值。这些发现连同配套数据集，可帮助维护者改进测试生成工具、为使用这些工具的开发者提供建议，并为未来测试易变性与测试生成研究奠定基础。

相关内容

TOOLS

关注 1

这个新版本的工具会议系列恢复了从1989年到2012年的50个会议的传统。工具最初是“面向对象语言和系统的技术”，后来发展到包括软件技术的所有创新方面。今天许多最重要的软件概念都是在这里首次引入的。2019年TOOLS 50+1在俄罗斯喀山附近举行，以同样的创新精神、对所有与软件相关的事物的热情、科学稳健性和行业适用性的结合以及欢迎该领域所有趋势和社区的开放态度，延续了该系列。官网链接：http://tools2019.innopolis.ru/

O’Reilly报告：知识图谱崛起——面向现代数据集成和数据结构体系，“The Rise of the Knowledge Graph——Toward Modern Data Integration and the Data Fabric Architecture”

专知会员服务

49+阅读 · 2022年2月18日

Linux导论，Introduction to Linux，96页ppt

专知会员服务

82+阅读 · 2020年7月26日

FlowQA: Grasping Flow in History for Conversational Machine Comprehension

专知会员服务

34+阅读 · 2019年10月18日

Auto-Sizing the Transformer Network: Improving Speed, Efficiency, and Performance for Low-Resource Machine Translation

专知会员服务

50+阅读 · 2019年10月17日