Selective inference (SI) has been actively studied as a promising framework for statistical hypothesis testing for data-driven hypotheses. The basic idea of SI is to make inferences conditional on an event that a hypothesis is selected. In order to perform SI, this event must be characterized in a traceable form. When selection event is too difficult to characterize, additional conditions are introduced for tractability. This additional conditions often causes the loss of power, and this issue is referred to as over-conditioning in [Fithian et al., 2014]. Parametric programming-based SI (PP-based SI) has been proposed as one way to address the over-conditioning issue. The main problem of PP-based SI is its high computational cost due to the need to exhaustively explore the data space. In this study, we introduce a procedure to reduce the computational cost while guaranteeing the desired precision, by proposing a method to compute the lower and upper bounds of p-values. We also proposed three types of search strategies that efficiently improve these bounds. We demonstrate the effectiveness of the proposed method in hypothesis testing problems for feature selection in linear models and attention region identification in deep neural networks.
翻译:选择性推断(SI)作为一种有前景的统计假设检验框架,已被广泛研究用于数据驱动假设的统计检验。SI的基本思想是在假设被选择的事件条件下进行推断。为了执行SI,该事件必须以可追溯的形式进行刻画。当选择事件过于复杂而难以刻画时,会引入额外条件以保证可处理性。这种额外条件常导致检验功效损失,该问题在[Fithian et al., 2014]中被称为过度条件化。参数规划型选择性推断(PP-based SI)被提出作为解决过度条件化问题的一种方法。PP-based SI的主要问题在于需要穷举搜索数据空间导致的高计算成本。本研究提出了一种在保证所需精度的同时降低计算成本的流程,通过计算p值的上下界来实现。我们还提出了三种高效改进这些界限的搜索策略。我们在线性模型特征选择以及深度神经网络注意力区域识别等假设检验问题中,验证了所提出方法的有效性。