On the Flakiness of LLM-Generated Tests for Industrial and Open-Source Database Management Systems

from arxiv, 12 pages, 5 tables, 3 figures, accepted at the 48th International Conference on Software Engineering: Software Engineering in Practice (ICSE SEIP 2026)

Flaky tests are a common problem in software testing. They produce inconsistent results when executed multiple times on the same code, invalidating the assumption that a test failure indicates a software defect. Recent work on LLM-based test generation has identified flakiness as a potential problem with generated tests. However, its prevalence and underlying causes are unclear. We examined the flakiness of LLM-generated tests in the context of four relational database management systems: SAP HANA, DuckDB, MySQL, and SQLite. We amplified test suites with two LLMs, GPT-4o and Mistral-Large-Instruct-2407, to assess the flakiness of the generated test cases. Our results suggest that generated tests have a slightly higher proportion of flaky tests compared to existing tests. Based on a manual inspection, we found that the most common root cause of flakiness was the reliance of a test on a certain order that is not guaranteed ("unordered collection"), which was present in 72 of 115 flaky tests (63%). Furthermore, both LLMs transferred the flakiness from the existing tests to the newly generated tests via the provided prompt context. Our experiments suggest that flakiness transfer is more prevalent in closed-source systems such as SAP HANA than in open-source systems. Our study informs developers on what types of flakiness to expect from LLM-generated tests. It also highlights the importance of providing LLMs with tailored context when employing LLMs for test generation.

翻译：不稳定测试是软件测试中的常见问题。当同一代码多次执行时，它们会产生不一致的结果，从而破坏了"测试失败即表示软件缺陷"的基本假设。近期基于LLM的测试生成研究已指出不稳定性是生成测试的潜在问题，但其普遍性与根本原因尚不明确。本研究考察了LLM生成测试在四种关系型数据库管理系统（SAP HANA、DuckDB、MySQL和SQLite）中的不稳定性现象。我们使用GPT-4o和Mistral-Large-Instruct-2407两种大语言模型对测试套件进行扩增，以评估生成测试用例的稳定性。实验结果表明，与现有测试相比，生成测试中不稳定测试的比例略高。通过人工检查，我们发现不稳定的最主要根源在于测试依赖未保证的特定顺序（"无序集合"），该情况在115个不稳定测试中出现72次（占比63%）。此外，两种LLM均能通过给定的提示上下文，将现有测试中的不稳定性传递至新生成的测试中。实验表明，这种不稳定性传递现象在SAP HANA等闭源系统中比在开源系统中更为普遍。本研究为开发者揭示了LLM生成测试可能产生的不稳定性类型，同时强调了在运用LLM进行测试生成时提供定制化上下文的重要性。