Automated Discovery of Test Oracles for Database Management Systems Using LLMs

Since 2020, automated testing for Database Management Systems (DBMSs) has flourished, uncovering hundreds of bugs in widely-used systems. A cornerstone of these techniques is test oracle, which typically implements a mechanism to generate equivalent query pairs, thereby identifying bugs by checking the consistency between their results. However, while applying these oracles can be automated, their design remains a fundamentally manual endeavor. This paper explores the use of large language models (LLMs) to automate the discovery and instantiation of test oracles, addressing a long-standing bottleneck towards fully automated DBMS testing. Although LLMs demonstrate impressive creativity, they are prone to hallucinations that can produce numerous false positive bug reports. Furthermore, their significant monetary cost and latency mean that LLM invocations should be limited to ensure that bug detection is efficient and economical. To this end, we introduce Argus, a novel framework built upon the core concept of the Constrained Abstract Query - a SQL skeleton containing placeholders and their associated instantiation conditions (e.g., requiring a placeholder to be filled by a boolean column). Argus uses LLMs to generate pairs of these skeletons that are asserted to be semantically equivalent. This equivalence is then formally proven using a SQL equivalence solver to ensure soundness. Finally, the placeholders within the verified skeletons are instantiated with concrete, reusable SQL snippets that are also synthesized by LLMs to efficiently produce complex test cases. We implemented Argus and evaluated it on five extensively tested DBMSs, discovering 40 previously unknown bugs, 35 of which are logic bugs, with 36 confirmed and 26 already fixed by the developers.

翻译：自2020年以来，数据库管理系统（DBMS）的自动化测试蓬勃发展，已在广泛使用的系统中发现数百个缺陷。这些技术的核心是测试预言，通常实现一种生成等价查询对的机制，通过检查其结果的一致性来识别缺陷。然而，尽管这些预言的实施可以实现自动化，但其设计本质上仍然依赖人工完成。本文探索利用大语言模型（LLMs）自动发现和实例化测试预言，以解决DBMS全自动化测试中长期存在的瓶颈问题。尽管LLMs展现出惊人的创造力，但容易产生幻觉，导致大量误报的缺陷报告。此外，其高昂的货币成本和延迟意味着应限制LLM调用次数，以确保缺陷检测的高效性和经济性。为此，我们提出Argus——一个基于约束抽象查询核心概念的新型框架，该查询是一种包含占位符及其关联实例化条件（如要求占位符由布尔列填充）的SQL骨架。Argus利用LLMs生成这些被断言为语义等价的骨架对，随后通过SQL等价求解器形式化证明其等价性以确保正确性。最后，已验证骨架中的占位符被实例化为由LLMs合成的具体可复用SQL片段，高效生成复杂测试用例。我们实现了Argus并在五个经过广泛测试的DBMS上开展评估，发现40个先前未知的缺陷，其中35个为逻辑缺陷，36个已获确认，26个已被开发者修复。