A data analysis pipeline is a structured sequence of steps that transforms raw data into meaningful insights by integrating multiple analysis algorithms.In many practical applications, analytical findings are obtained only after data pass through several data-dependent procedures within such pipelines.In this study, we address the problem of quantifying the statistical reliability of results produced by data analysis pipelines.As a proof of concept, we focus on clustering pipelines that identify cluster structures from complex and heterogeneous data through procedures such as outlier detection, feature selection, and clustering.We propose a novel statistical testing framework to assess the significance of clustering results obtained through these pipelines.Our framework, based on selective inference, enables the systematic construction of valid statistical tests for clustering pipelines composed of predefined components.We prove that the proposed test controls the type I error rate at any nominal level and demonstrate its validity and effectiveness through experiments on synthetic and real datasets.
翻译:数据分析流水线是将原始数据转化为有意义洞察的结构化步骤序列,通过整合多种分析算法实现。在许多实际应用中,分析结论往往需要数据经过流水线中多个数据依赖步骤才能获得。本研究聚焦于量化数据分析流水线产出结果的统计可靠性这一核心问题。作为概念验证,我们以聚类流水线为研究对象——此类流水线通过异常检测、特征选择和聚类等步骤,从复杂异构数据中识别簇结构。我们提出了一种新型统计检验框架,用于评估通过此类流水线获得的聚类结果的显著性。该框架基于选择性推断,能够为包含预定义组件的聚类流水线系统性地构建有效的统计检验。我们证明所提出的检验能在任意名义水平上控制第一类错误率,并通过合成数据集和真实数据集的实验验证了其有效性与可行性。