We propose and study a framework for quantifying the importance of the choices of parameter values to the result of a query over a database. These parameters occur as constants in logical queries, such as conjunctive queries. In our framework, the importance of a parameter is its SHAP score. This score is a popular instantiation of the game-theoretic Shapley value to measuring the importance of feature values in machine learning models. We make the case for the rationale of using this score by explaining the intuition behind SHAP, and by showing that we arrive at this score in two different, apparently opposing, approaches to quantifying the contribution of a parameter. The application of the SHAP score requires two components in addition to the query and the database: (a) a probability distribution over the combinations of parameter values, and (b) a utility function that measures the similarity between the result for the original parameters and the result for hypothetical parameters. The main question addressed in the paper is the complexity of calculating the SHAP score for different distributions and similarity measures. We first address the case of probabilistically independent parameters. The problem is hard if we consider a fragment of queries that is hard to evaluate (as one would expect), and even for the fragment of acyclic conjunctive queries. In some cases, though, one can efficiently list all relevant parameter combinations, and then the SHAP score can be computed in polynomial time under reasonable general conditions. Also tractable is the case of full acyclic conjunctive queries for certain (natural) similarity functions. We extend our results to conjunctive queries with inequalities between variables and parameters. Finally, we discuss a simple approximation technique for the case of correlated parameters.
翻译:我们提出并研究了一个用以量化参数值选择对数据库查询结果重要性的框架。这些参数以常量的形式出现在逻辑查询(如合取查询)中。在该框架中,参数的重要性由其SHAP得分衡量。该得分是博弈论中Shapley值在机器学习模型特征值重要性度量中的一种流行实现。我们通过阐释SHAP背后的直觉,并展示通过两种不同(看似对立)的参数贡献量化方法均可推导出该得分,论证了采用此得分的合理性。SHAP得分的应用除了查询和数据库外还需两个组件:(a) 参数值组合上的概率分布,以及(b) 衡量原始参数结果与假设参数结果之间相似度的效用函数。本文研究的主要问题是针对不同分布和相似度度量计算SHAP得分的复杂度。我们首先考虑参数概率独立的情形。对于难以评估的查询片段(正如预期),甚至对于无环合取查询片段,该问题均为困难问题。但在某些情况下,可以高效枚举所有相关参数组合,此时在合理的通用条件下SHAP得分可在多项式时间内计算。对于特定(自然)相似度函数的全无环合取查询,情形也是可计算的。我们将结果推广至包含变量与参数间不等式的合取查询。最后,我们讨论了针对相关参数情形的一种简单近似技术。