Influential diagnosis is an integral part of data analysis, of which most existing methodological frameworks presume a deterministic submodel and are designed for low-dimensional data (i.e., the number of predictors p smaller than the sample size n). However, the stochastic selection of a submodel from high-dimensional data where p exceeds n has become ubiquitous. Thus, methods for identifying observations that could exert undue influence on the choice of a submodel can play an important role in this setting. To date, discussion of this topic has been limited, falling short in two domains: (i) constrained ability to detect multiple influential points, and (ii) applicability only in restrictive settings. After describing the problem, we characterize and formalize the concept of influential observations on variable selection. Then, we propose a generalized diagnostic measure, extended from an available metric accommodating different model selectors and multiple influential observations, the asymptotic distribution of which is subsequently establish large p, thus providing guidelines to ascertain influential observations. A high-dimensional clustering procedure is further incorporated into our proposed scheme to detect multiple influential points. Simulation is conducted to assess the performances of various diagnostic approaches. The proposed procedure further demonstrates its value in improving predictive power when analyzing thermal-stimulated pain based on fMRI data.
翻译:影响诊断是数据分析的重要组成部分,现有的大多数方法论框架假设确定性子模型,并针对低维数据(即预测变量p小于样本量n)设计。然而,从p超过n的高维数据中随机选择子模型的做法已变得普遍。因此,识别可能对子模型选择产生不当影响的观测点的方法在此背景下具有重要作用。迄今为止,关于此主题的讨论有限,存在两个不足:(i)检测多个影响点的能力受限,以及(ii)仅适用于限制性环境。在描述问题后,我们刻画并形式化了变量选择中影响观测点的概念。然后,我们提出了一种广义诊断度量,该度量基于现有指标进行扩展,能够适应不同的模型选择器和多个影响观测点,随后确立了其在大p情况下的渐近分布,从而为确定影响观测点提供了指导。进一步地,我们将高维聚类过程纳入所提出的方案中,以检测多个影响点。通过模拟评估了各种诊断方法的性能。所提出的方法在基于fMRI数据分析热刺激疼痛时,进一步展示了其在提升预测能力方面的价值。