Inferring variable importance is the key problem of many scientific studies, where researchers seek to learn the effect of a feature $X$ on the outcome $Y$ in the presence of confounding variables $Z$. Focusing on classification problems, we define the expected total variation (ETV), which is an intuitive and deterministic measure of variable importance that does not rely on any model context. We then introduce algorithms for statistical inference on the ETV under design-based/model-X assumptions. These algorithms build on the floodgate notion for regression problems (Zhang and Janson 2020). The algorithms we introduce can leverage any user-specified regression function and produce asymptotic lower confidence bounds for the ETV. We show the effectiveness of our algorithms with simulations and a case study in conjoint analysis on the US general election.
翻译:推断变量重要性是许多科学研究的关键问题,研究者希望在学习特征$X$对结果$Y$的影响时,考虑混淆变量$Z$的存在。聚焦于分类问题,我们定义了期望总变差(ETV),这是一种直观且确定性的变量重要性度量,不依赖于任何模型背景。随后,我们提出了在设计/模型-X假设下对ETV进行统计推断的算法。这些算法基于回归问题中的闸门概念(Zhang and Janson 2020)。我们引入的算法可以利用任何用户指定的回归函数,并为ETV生成渐近下置信界。通过模拟实验和一项关于美国大选的联合分析案例研究,我们展示了所提算法的有效性。