Query Optimization and Evaluation via Information Theory: A Tutorial

Database theory is exciting because it studies highly general and practically useful abstractions. Conjunctive query (CQ) evaluation is a prime example: it simultaneously generalizes graph pattern matching, constraint satisfaction, and statistical inference, among others. This generality is both the strength and the central challenge of the field. The query optimization and evaluation problem is fundamentally a "meta-algorithm" problem: given a query $Q$ and statistics $\cal S$ about the input database, how should one best answer $Q$? Because the problem is so general, it is often impossible for such a meta-algorithm to match the runtimes of specialized algorithms designed for a fixed query -- or so it seemed. The past fifteen years have witnessed an exciting development in database theory: a general framework, called PANDA, that emerged from advances in database theory, constraint satisfaction problems (CSP), and graph algorithms, for evaluating conjunctive queries given input data statistics. The key idea is to derive information-theoretically tight upper bounds on the cardinalities of intermediate relations produced during query evaluation. These bounds determine the costs of query plans, and crucially, the query plans themselves are derived directly from the mathematical proof of the upper bound. This tight coupling of proof and algorithm is what makes PANDA both principled and powerful. Remarkably, this generic algorithm matches -- and in some cases subsumes -- the runtimes of specialized algorithms for the same problems, including algorithms that exploit fast matrix multiplication. This paper is a tutorial on the PANDA framework. We illustrate the key ideas through concrete examples, conveying the main intuitions behind the theory.

翻译：数据库理论之所以令人振奋，在于其研究高度通用且具有实际意义的抽象概念。合取查询评估便是典型范例：它同时涵盖了图模式匹配、约束满足和统计推断等领域。这种通用性既是该领域的优势，也是核心挑战。查询优化与评估问题本质上是一个"元算法"问题：给定查询$Q$和输入数据库的统计信息$\cal S$，如何以最佳方式回答$Q$？由于该问题极具通用性，这类元算法往往难以达到针对固定查询设计的专用算法的运行时间——但事实似乎并非如此。过去十五年间，数据库理论迎来了一项激动人心的发展：基于数据库理论、约束满足问题与图算法的进步，诞生了名为PANDA的通用框架，用于根据输入数据统计信息评估合取查询。其核心思想是在查询评估过程中，对中间关系产生的基数推导出信息论意义上的紧上界。这些上界决定了查询计划的代价，而更重要的是，查询计划本身直接源自对上述上界的数学证明。正是这种证明与算法的紧密耦合，使PANDA兼具严谨性与强大性。值得注意的是，该通用算法在同类问题的运行时间上能媲美——甚至在某些情况下超越——包含快速矩阵乘法在内的专用算法。本文是PANDA框架的教程，通过具体实例阐释核心思想，传递该理论的主要直觉。