One Size Does NOT Fit All: On the Importance of Physical Representations for Datalog Evaluation

Datalog is an increasingly popular recursive query language that is declarative by design, meaning its programs must be translated by an engine into the actual physical execution plan. When generating this plan, a central decision is how to physically represent all involved relations, an aspect in which existing Datalog engines are surprisingly restrictive and often resort to one-size-fits-all solutions. The reason for this is that the typical execution plan of a Datalog program not only performs a single type of operation against the physical representations, but a mixture of operations, such as insertions, lookups, and containment-checks. Further, the relevance of each operation type highly depends on the workload characteristics, which range from familiar properties such as the size, multiplicity, and arity of the individual relations to very specific Datalog properties, such as the "interweaving" of rules when relations occur multiple times, and in particular the recursiveness of the query which might generate new tuples on the fly during evaluation. This indicates that a variety of physical representations, each with its own strengths and weaknesses, is required to meet the specific needs of different workload situations. To evaluate this, we conduct an in-depth experimental study of the interplay between potentially suitable physical representations and seven dimensions of workload characteristics that vary across actual Datalog programs, revealing which properties actually matter. Based on these insights, we design an automatic selection mechanism that utilizes a set of decision trees to identify suitable physical representations for a given workload.

翻译：Datalog是一种日益流行的递归查询语言，其设计具有声明性，这意味着其程序必须由引擎转换为实际的物理执行计划。在生成该计划时，一个核心决策是如何物理表示所有涉及的关系，而现有Datalog引擎在这方面出人意料地受限，通常采用一体适用的解决方案。其原因在于，Datalog程序的典型执行计划并非仅对物理表示执行单一类型的操作，而是混合了插入、查找和包含性检查等多种操作。此外，每种操作类型的相关性高度依赖于工作负载特性，这些特性范围广泛，从个体关系的大小、多重性和元数等常见属性，到非常特定的Datalog属性，例如当关系多次出现时规则的“交织”情况，尤其是查询的递归性——它可能在求值过程中动态生成新元组。这表明，需要多种物理表示（每种都有其自身的优缺点）来满足不同工作负载场景的具体需求。为评估这一点，我们对潜在适用的物理表示与实际Datalog程序中变化的七个维度工作负载特性之间的相互作用进行了深入的实验研究，揭示了哪些属性真正重要。基于这些发现，我们设计了一种自动选择机制，该机制利用一组决策树来为给定工作负载识别合适的物理表示。