A data analysis pipeline is a structured sequence of steps that transforms raw data into meaningful insights by integrating multiple analysis algorithms. In many practical applications, analytical findings are obtained only after data pass through several data-dependent procedures within such pipelines. In this study, we address the problem of quantifying the statistical reliability of results produced by data analysis pipelines. As a proof of concept, we focus on clustering pipelines that identify cluster structures from complex and heterogeneous data through procedures such as outlier detection, feature selection, and clustering. We propose a novel statistical testing framework to assess the significance of clustering results obtained through these pipelines. Our framework, based on selective inference, enables the systematic construction of valid statistical tests for clustering pipelines composed of predefined components. We prove that the proposed test controls the type I error rate at any nominal level and demonstrate its validity and effectiveness through experiments on synthetic and real datasets.
翻译:数据分析流程是将原始数据转化为有意义洞察的结构化步骤序列,通过整合多种分析算法实现。在实际应用中,分析结论往往需要数据经过此类流程中多个数据依赖步骤后方可获取。本研究针对数据分析流程输出结果的统计可靠性量化问题展开探讨。作为概念验证,我们聚焦于通过异常检测、特征选择和聚类等流程从复杂异构数据中识别聚类结构的聚类流程。我们提出了一种新颖的统计检验框架,用于评估此类流程所得聚类结果的显著性。该框架基于选择性推断方法,可系统构建由预定义组件组成的聚类流程的有效统计检验。理论证明所提检验能在任意名义水平控制第一类错误率,并通过合成数据集与真实数据集的实验验证了其有效性与实用性。