Datalog is a popular and widely-used declarative logic programming language. Datalog engines apply many cross-rule optimizations; bugs in them can cause incorrect results. To detect such optimization bugs, we propose an automated testing approach called Incremental Rule Evaluation (IRE), which synergistically tackles the test oracle and test case generation problem. The core idea behind the test oracle is to compare the results of an optimized program and a program without cross-rule optimization; any difference indicates a bug in the Datalog engine. Our core insight is that, for an optimized, incrementally-generated Datalog program, we can evaluate all rules individually by constructing a reference program to disable the optimizations that are performed among multiple rules. Incrementally generating test cases not only allows us to apply the test oracle for every new rule generated-we also can ensure that every newly added rule generates a non-empty result with a given probability and eschew recomputing already-known facts. We implemented IRE as a tool named Deopt, and evaluated Deopt on four mature Datalog engines, namely Souffl\'e, CozoDB, $\mu$Z, and DDlog, and discovered a total of 30 bugs. Of these, 13 were logic bugs, while the remaining were crash and error bugs. Deopt can detect all bugs found by queryFuzz, a state-of-the-art approach. Out of the bugs identified by Deopt, queryFuzz might be unable to detect 5. Our incremental test case generation approach is efficient; for example, for test cases containing 60 rules, our incremental approach can produce 1.17$\times$ (for DDlog) to 31.02$\times$ (for Souffl\'e) as many valid test cases with non-empty results as the naive random method. We believe that the simplicity and the generality of the approach will lead to its wide adoption in practice.
翻译:Datalog是一种流行且广泛使用的声明式逻辑编程语言。Datalog引擎应用了许多跨规则优化;其中的缺陷可能导致错误的结果。为了检测此类优化缺陷,我们提出了一种名为增量规则评估(IRE)的自动化测试方法,该方法协同解决了测试预言和测试用例生成问题。测试预言的核心思想是比较经过优化的程序与未使用跨规则优化的程序的结果;任何差异都表明Datalog引擎存在缺陷。我们的核心见解是,对于一个经过优化的、增量生成的Datalog程序,我们可以通过构建一个参考程序来单独评估所有规则,从而禁用跨多条规则执行的优化。增量生成测试用例不仅使我们能够为每个新生成的规则应用测试预言——我们还能确保每个新添加的规则以给定概率生成非空结果,并避免重新计算已知事实。我们将IRE实现为一个名为Deopt的工具,并在四个成熟的Datalog引擎——Soufflé、CozoDB、μZ和DDlog上评估了Deopt,共发现了30个缺陷。其中,13个是逻辑缺陷,其余为崩溃和错误缺陷。Deopt能够检测出最先进方法queryFuzz发现的所有缺陷。在Deopt识别出的缺陷中,queryFuzz可能无法检测出其中5个。我们的增量测试用例生成方法是高效的;例如,对于包含60条规则的测试用例,与朴素随机方法相比,我们的增量方法可以生成1.17倍(对于DDlog)到31.02倍(对于Soufflé)的有效非空结果测试用例。我们相信,该方法的简单性和通用性将使其在实践中得到广泛应用。