The Importance of Parameters in Database Queries

We propose and study a framework for quantifying the importance of the choices of parameter values to the result of a query over a database. These parameters occur as constants in logical queries, such as conjunctive queries. In our framework, the importance of a parameter is its SHAP score - a popular instantiation of the game-theoretic Shapley value to measure the importance of feature values in machine learning models. We make the case for the rationale of using this score by explaining the intuition behind SHAP, and by showing that we arrive at this score in two different, apparently opposing, approaches to quantifying the contribution of a parameter. The application SHAP requires two components in addition to the query and the database: (a) a probability distribution over the combinations of parameter values, and (b) a utility function that measures the similarity between the result for the original parameters and the result for hypothetical parameters. The main question addressed in the paper is the complexity of calculating the SHAP score for different distributions and similarity measures. In particular, we devise polynomial-time algorithms for the case of full acyclic conjunctive queries for certain (natural) similarity functions. We extend our results to conjunctive queries with parameterized filters (e.g., inequalities between variables and parameters). We also illustrate the application of our results to "why-not" explanations (aiming to explain the absence of a query answer), where we consider the task of quantifying the contribution of query components to the elimination of a non-answer in consideration. Finally, we discuss a simple approximation technique for the case of correlated parameters.

翻译：我们提出并研究了一个框架，用于量化数据库查询中参数值选择对查询结果的重要性。这些参数作为常量出现在逻辑查询（如合取查询）中。在我们的框架中，参数的重要性由其SHAP分数衡量——这是博弈论中沙普利值的一种流行实例化方法，常用于衡量机器学习模型中特征值的重要性。我们通过解释SHAP背后的直观原理，并展示通过两种看似对立的不同方法来量化参数贡献时均得到该分数，论证了采用此评分标准的合理性。SHAP的应用除了查询和数据库外还需要两个组成部分：(a) 参数值组合的概率分布；(b) 衡量原始参数结果与假设参数结果之间相似性的效用函数。本文解决的核心问题是计算不同分布和相似性度量下SHAP分数的复杂度。特别地，我们针对某些（自然）相似性函数，为完全无环合取查询设计了多项式时间算法。我们将结果扩展到具有参数化过滤器（例如变量与参数之间的不等式）的合取查询。我们还展示了研究结果在"为何非"解释（旨在解释查询答案缺失的原因）中的应用，其中我们考虑了量化查询组件对排除特定非答案贡献度的任务。最后，我们讨论了参数相关情况下的简单近似技术。