Selective inference (SI) has been actively studied as a promising framework for statistical hypothesis testing for data-driven hypotheses. The basic idea of SI is to make inferences conditional on an event that a hypothesis is selected. In order to perform SI, this event must be characterized in a traceable form. When selection event is too difficult to characterize, additional conditions are introduced for tractability. This additional conditions often causes the loss of power, and this issue is referred to as over-conditioning. Parametric programming-based SI (PP-based SI) has been proposed as one way to address the over-conditioning issue. The main problem of PP-based SI is its high computational cost due to the need to exhaustively explore the data space. In this study, we introduce a procedure to reduce the computational cost while guaranteeing the desired precision, by proposing a method to compute the upper and lower bounds of p-values. We also proposed three types of search strategies that efficiently improve these bounds. We demonstrate the effectiveness of the proposed method in hypothesis testing problems for feature selection in linear models and attention region identification in deep neural networks.
翻译:选择性推断(SI)作为一种针对数据驱动假设进行统计假设检验的有效框架,近年来受到广泛关注。其核心思想是:在假设被选中的条件下进行条件推断。为实施选择性推断,该选择事件必须以可追溯形式进行刻画。当选择事件过于复杂难以表征时,研究者常引入附加条件以保证可处理性,但这种做法往往导致检验功效损失,该问题被称为"过度条件化"。参数规划选择性推断(PP-based SI)是解决过度条件化问题的有效途径之一,但其主要缺陷在于需穷举搜索数据空间而导致极高的计算成本。本研究提出一种既能降低计算成本又能保证期望精度的新方法,通过计算p值的上下界来实现。我们同时设计了三种搜索策略以高效改进这些界值。在线性模型特征选择和深度神经网络注意力区域识别等假设检验问题中,我们验证了所提方法的有效性。