Statistical Testing Framework for Clustering Pipelines by Selective Inference - 专知论文

会员服务 ·

0

统计检验 · 分析 · 推断 · 结构 · 数据集 ·

Statistical Testing Framework for Clustering Pipelines by Selective Inference

翻译：基于选择性推断的聚类流程统计检验框架

Yugo Miyata,Tomohiro Shiraishi,Shunichi Nishino,Ichiro Takeuchi

from arxiv, 59 pages, 11 figures

A data analysis pipeline is a structured sequence of steps that transforms raw data into meaningful insights by integrating multiple analysis algorithms. In many practical applications, analytical findings are obtained only after data pass through several data-dependent procedures within such pipelines. In this study, we address the problem of quantifying the statistical reliability of results produced by data analysis pipelines. As a proof of concept, we focus on clustering pipelines that identify cluster structures from complex and heterogeneous data through procedures such as outlier detection, feature selection, and clustering. We propose a novel statistical testing framework to assess the significance of clustering results obtained through these pipelines. Our framework, based on selective inference, enables the systematic construction of valid statistical tests for clustering pipelines composed of predefined components. We prove that the proposed test controls the type I error rate at any nominal level and demonstrate its validity and effectiveness through experiments on synthetic and real datasets.

翻译：数据分析流程是将原始数据转化为有意义洞察的结构化步骤序列，通过整合多种分析算法实现。在实际应用中，分析结论往往需要数据经过此类流程中多个数据依赖步骤后方可获取。本研究针对数据分析流程输出结果的统计可靠性量化问题展开探讨。作为概念验证，我们聚焦于通过异常检测、特征选择和聚类等流程从复杂异构数据中识别聚类结构的聚类流程。我们提出了一种新颖的统计检验框架，用于评估此类流程所得聚类结果的显著性。该框架基于选择性推断方法，可系统构建由预定义组件组成的聚类流程的有效统计检验。理论证明所提检验能在任意名义水平控制第一类错误率，并通过合成数据集与真实数据集的实验验证了其有效性与实用性。

0

相关内容

统计检验

基于因果推断的推荐系统去偏研究

基于因果推断的推荐系统去偏研究

专知会员服务

21+阅读 · 2024年11月10日

《基于高斯混合流和入包的异常检测》2023最新57页论文

《基于高斯混合流和入包的异常检测》2023最新57页论文

专知会员服务

29+阅读 · 2023年5月15日

具有组合结构的统计推断和在线算法

具有组合结构的统计推断和在线算法

专知会员服务

12+阅读 · 2022年12月13日

【斯坦福大学博士论文】复杂统计模型中的因果和选择性推理，274页pdf

【斯坦福大学博士论文】复杂统计模型中的因果和选择性推理，274页pdf

专知会员服务

86+阅读 · 2022年9月15日

【Alex Nowak-Vila博士论文】有理论保证的结构化预测， Structured Prediction with Theoretical Guarantees

【Alex Nowak-Vila博士论文】有理论保证的结构化预测， Structured Prediction with Theoretical Guarantees

专知会员服务

13+阅读 · 2022年3月15日

【干货书】数据科学统计推断，124页pdf

专知会员服务

79+阅读 · 2021年10月12日

【AAAI2021】对比聚类，Contrastive Clustering

【AAAI2021】对比聚类，Contrastive Clustering

专知会员服务

78+阅读 · 2021年1月30日

复杂的序列数据分析：现有算法的系统文献综述，Complex Sequential Data Analysis: A Systematic Literature Review of Existing Algorithms

复杂的序列数据分析：现有算法的系统文献综述，Complex Sequential Data Analysis: A Systematic Literature Review of Existing Algorithms

专知会员服务

27+阅读 · 2020年7月24日

【IJCAI2020】统计相关模型，A Complete Characterization of Projectivity for Statistical Relational Models

【IJCAI2020】统计相关模型，A Complete Characterization of Projectivity for Statistical Relational Models

专知会员服务

20+阅读 · 2020年4月25日

From Data to Model Programming: Injecting Structured Priors for Knowledge Extraction，南加州大学计算机科学系任翔助理教授，CIPS ATT 16（2019）

From Data to Model Programming: Injecting Structured Priors for Knowledge Extraction，南加州大学计算机科学系任翔助理教授，CIPS ATT 16（2019）

专知会员服务

14+阅读 · 2019年10月25日

【干货书】基于统计和机器学习的实用时间序列分析预测，Time Series Analysis Prediction

【干货书】基于统计和机器学习的实用时间序列分析预测，Time Series Analysis Prediction

专知

18+阅读 · 2022年4月9日

【干货书】统计基础、推理与推断，361页pdf

【干货书】统计基础、推理与推断，361页pdf

专知

11+阅读 · 2022年1月25日

【AAAI2021】对比聚类，Contrastive Clustering

【AAAI2021】对比聚类，Contrastive Clustering

专知

26+阅读 · 2021年1月30日

最新「因果推断Causal Inference」综述论文38页pdf，阿里巴巴、Buffalo、Georgia、Virginia

最新「因果推断Causal Inference」综述论文38页pdf，阿里巴巴、Buffalo、Georgia、Virginia

专知

68+阅读 · 2020年2月11日

详解 | 推荐系统的工程实现

详解 | 推荐系统的工程实现

AI100

42+阅读 · 2019年3月15日

【大数据】海量数据分析能力形成和大数据关键技术

【大数据】海量数据分析能力形成和大数据关键技术

产业智能官

17+阅读 · 2018年10月29日

推荐系统算法合集，满满都是干货（建议收藏）

推荐系统算法合集，满满都是干货（建议收藏）

七月在线实验室

17+阅读 · 2018年7月23日

干货：基于用户画像的聚类分析

干货：基于用户画像的聚类分析

数据分析

22+阅读 · 2018年5月17日

【入门】数据分析六部曲

【入门】数据分析六部曲

36大数据

18+阅读 · 2017年12月6日

【大数据】数据挖掘与数据分析知识流程梳理

【大数据】数据挖掘与数据分析知识流程梳理

产业智能官

13+阅读 · 2017年9月22日

抽样环境下基于流记录的行为特征分析与多分类器识别模型研究

国家自然科学基金

0+阅读 · 2015年12月31日

基于数据特征选择与匹配的工业过程监测方法研究

国家自然科学基金

1+阅读 · 2015年12月31日

非确定型Web服务流程重组的可靠性验证技术

国家自然科学基金

1+阅读 · 2015年12月31日

基于复杂数据的回归模型统计推断及其应用

国家自然科学基金

3+阅读 · 2015年12月31日

多重排序数据的整合分析

国家自然科学基金

0+阅读 · 2015年12月31日

试验设计中的模型选择

国家自然科学基金

6+阅读 · 2014年12月31日

测量误差数据下约束线性模型的有偏估计及变量选择研究

国家自然科学基金

0+阅读 · 2014年12月31日

相依回归模型与扩散过程的统计推断及其应用

国家自然科学基金

1+阅读 · 2014年12月31日

流程监控与评估中多元数据整合研究

国家自然科学基金

1+阅读 · 2014年12月31日

因果推断的统计方法

国家自然科学基金

26+阅读 · 2011年12月31日

Probabilistic data quality assessment for structural monitoring data via outlier-resistant conditional diffusion model

Arxiv

0+阅读 · 4月29日

Detecting dependence structure: visualization and inference

Arxiv

0+阅读 · 4月24日

Statistical inference with win statistics in cluster-randomized trials with composite outcomes

Arxiv

0+阅读 · 4月20日

Propensity Score Propagation: A General Framework for Design-Based Inference with Unknown Propensity Scores

Arxiv

0+阅读 · 4月16日

Forecasting Multivariate Time Series under Predictive Heterogeneity: A Validation-Driven Clustering Framework

Arxiv

0+阅读 · 4月15日

Quantitative Verification with Neural Networks

Arxiv

0+阅读 · 4月14日

Statistical inference for subgraph counts and clustering coefficient using network sampling in a sparse Stochastic Block Model framework

Arxiv

0+阅读 · 4月13日

Inference in Regression Discontinuity Designs with Clustered Data

Arxiv

0+阅读 · 3月19日

Statistical Testing Framework for Clustering Pipelines by Selective Inference

Arxiv

0+阅读 · 3月19日

Statistical Inference for Online Algorithms

Arxiv

0+阅读 · 3月18日

VIP会员

文章信息

相关主题

最新内容

美国从乌克兰无人机战争中学习经验

美国从乌克兰无人机战争中学习经验

专知会员服务

7+阅读 · 6月21日

ICML 2026 | 面向视觉语言模型的语义鲁棒性认证

ICML 2026 | 面向视觉语言模型的语义鲁棒性认证

专知会员服务

5+阅读 · 6月21日

综述 | 智能体电子设计自动化：从“交接有效性”重新理解Agentic EDA

综述 | 智能体电子设计自动化：从“交接有效性”重新理解Agentic EDA

专知会员服务

7+阅读 · 6月21日

深入解读 Palantir AIP：全球最具争议的人工智能平台究竟如何运作

深入解读 Palantir AIP：全球最具争议的人工智能平台究竟如何运作

专知会员服务

20+阅读 · 6月20日

ICML 2026 | 多任务贝叶斯上下文学习：让 Transformer 在测试时显式适应新先验

ICML 2026 | 多任务贝叶斯上下文学习：让 Transformer 在测试时显式适应新先验

专知会员服务

5+阅读 · 6月19日

ACL 2026综述 | 大规模手语数据集：资源、基准与标注标准

ACL 2026综述 | 大规模手语数据集：资源、基准与标注标准

专知会员服务

8+阅读 · 6月19日

ICML 2026 Spotlight | SmoothSMoE：解析稀疏 MoE 路由不连续

ICML 2026 Spotlight | SmoothSMoE：解析稀疏 MoE 路由不连续

专知会员服务

7+阅读 · 6月18日

综述 | 周期表视角下的大模型推理：范式、方法与失败模式

综述 | 周期表视角下的大模型推理：范式、方法与失败模式

专知会员服务

9+阅读 · 6月18日

《廉价自杀式无人机战争的军事战略影响：乌克兰和伊朗案例研究》

《廉价自杀式无人机战争的军事战略影响：乌克兰和伊朗案例研究》

专知会员服务

13+阅读 · 6月18日

《面向反无人机作战的联邦式可解释射频–光电/红外情报融合：边缘人工智能优化、电子战韧性及分布式监视验证》

《面向反无人机作战的联邦式可解释射频–光电/红外情报融合：边缘人工智能优化、电子战韧性及分布式监视验证》

专知会员服务

12+阅读 · 6月18日

ICML 2026 | FR3D：解耦自车运动的未来动态三维重建世界模型

ICML 2026 | FR3D：解耦自车运动的未来动态三维重建世界模型

专知会员服务

8+阅读 · 6月17日

【伯克利博士论文】迈向可扩展与自我演进的大语言模型智能体

【伯克利博士论文】迈向可扩展与自我演进的大语言模型智能体

专知会员服务

13+阅读 · 6月17日

学习数据的几何：形状空间分析数学综述

学习数据的几何：形状空间分析数学综述

专知会员服务

10+阅读 · 6月17日

《现代防空系统综述：架构、传感器、拦截器及新兴威胁环境对基础设施受限防御环境的影响》2026最新长综述

《现代防空系统综述：架构、传感器、拦截器及新兴威胁环境对基础设施受限防御环境的影响》2026最新长综述

专知会员服务

24+阅读 · 6月17日

定向能反无人机系统最新发展动态

定向能反无人机系统最新发展动态

专知会员服务

12+阅读 · 6月17日

相关VIP内容

基于因果推断的推荐系统去偏研究

基于因果推断的推荐系统去偏研究

专知会员服务

21+阅读 · 2024年11月10日

《基于高斯混合流和入包的异常检测》2023最新57页论文

《基于高斯混合流和入包的异常检测》2023最新57页论文

专知会员服务

29+阅读 · 2023年5月15日

具有组合结构的统计推断和在线算法

具有组合结构的统计推断和在线算法

专知会员服务

12+阅读 · 2022年12月13日

【斯坦福大学博士论文】复杂统计模型中的因果和选择性推理，274页pdf

【斯坦福大学博士论文】复杂统计模型中的因果和选择性推理，274页pdf

专知会员服务

86+阅读 · 2022年9月15日

【Alex Nowak-Vila博士论文】有理论保证的结构化预测， Structured Prediction with Theoretical Guarantees

【Alex Nowak-Vila博士论文】有理论保证的结构化预测， Structured Prediction with Theoretical Guarantees

专知会员服务

13+阅读 · 2022年3月15日

【干货书】数据科学统计推断，124页pdf

专知会员服务

79+阅读 · 2021年10月12日

【AAAI2021】对比聚类，Contrastive Clustering

【AAAI2021】对比聚类，Contrastive Clustering

专知会员服务

78+阅读 · 2021年1月30日

复杂的序列数据分析：现有算法的系统文献综述，Complex Sequential Data Analysis: A Systematic Literature Review of Existing Algorithms

复杂的序列数据分析：现有算法的系统文献综述，Complex Sequential Data Analysis: A Systematic Literature Review of Existing Algorithms

专知会员服务

27+阅读 · 2020年7月24日

【IJCAI2020】统计相关模型，A Complete Characterization of Projectivity for Statistical Relational Models

【IJCAI2020】统计相关模型，A Complete Characterization of Projectivity for Statistical Relational Models

专知会员服务

20+阅读 · 2020年4月25日

From Data to Model Programming: Injecting Structured Priors for Knowledge Extraction，南加州大学计算机科学系任翔助理教授，CIPS ATT 16（2019）

From Data to Model Programming: Injecting Structured Priors for Knowledge Extraction，南加州大学计算机科学系任翔助理教授，CIPS ATT 16（2019）

专知会员服务

14+阅读 · 2019年10月25日

热门VIP内容

开通专知VIP会员享更多权益服务

ICML 2026 | 面向视觉语言模型的语义鲁棒性认证

深入解读 Palantir AIP：全球最具争议的人工智能平台究竟如何运作

美国从乌克兰无人机战争中学习经验

综述 | 智能体电子设计自动化：从“交接有效性”重新理解Agentic EDA

相关资讯

【干货书】基于统计和机器学习的实用时间序列分析预测，Time Series Analysis Prediction

【干货书】基于统计和机器学习的实用时间序列分析预测，Time Series Analysis Prediction

专知

18+阅读 · 2022年4月9日

【干货书】统计基础、推理与推断，361页pdf

【干货书】统计基础、推理与推断，361页pdf

专知

11+阅读 · 2022年1月25日

【AAAI2021】对比聚类，Contrastive Clustering

【AAAI2021】对比聚类，Contrastive Clustering

专知

26+阅读 · 2021年1月30日

最新「因果推断Causal Inference」综述论文38页pdf，阿里巴巴、Buffalo、Georgia、Virginia

最新「因果推断Causal Inference」综述论文38页pdf，阿里巴巴、Buffalo、Georgia、Virginia

专知

68+阅读 · 2020年2月11日

详解 | 推荐系统的工程实现

详解 | 推荐系统的工程实现

AI100

42+阅读 · 2019年3月15日

【大数据】海量数据分析能力形成和大数据关键技术

【大数据】海量数据分析能力形成和大数据关键技术

产业智能官

17+阅读 · 2018年10月29日

推荐系统算法合集，满满都是干货（建议收藏）

推荐系统算法合集，满满都是干货（建议收藏）

七月在线实验室

17+阅读 · 2018年7月23日

干货：基于用户画像的聚类分析

干货：基于用户画像的聚类分析

数据分析

22+阅读 · 2018年5月17日

【入门】数据分析六部曲

【入门】数据分析六部曲

36大数据

18+阅读 · 2017年12月6日

【大数据】数据挖掘与数据分析知识流程梳理

【大数据】数据挖掘与数据分析知识流程梳理

产业智能官

13+阅读 · 2017年9月22日

相关论文

Probabilistic data quality assessment for structural monitoring data via outlier-resistant conditional diffusion model

Arxiv

0+阅读 · 4月29日

Detecting dependence structure: visualization and inference

Arxiv

0+阅读 · 4月24日

Statistical inference with win statistics in cluster-randomized trials with composite outcomes

Arxiv

0+阅读 · 4月20日

Propensity Score Propagation: A General Framework for Design-Based Inference with Unknown Propensity Scores

Arxiv

0+阅读 · 4月16日

Forecasting Multivariate Time Series under Predictive Heterogeneity: A Validation-Driven Clustering Framework

Arxiv

0+阅读 · 4月15日

Quantitative Verification with Neural Networks

Arxiv

0+阅读 · 4月14日

Statistical inference for subgraph counts and clustering coefficient using network sampling in a sparse Stochastic Block Model framework

Arxiv

0+阅读 · 4月13日

Inference in Regression Discontinuity Designs with Clustered Data

Arxiv

0+阅读 · 3月19日

Statistical Testing Framework for Clustering Pipelines by Selective Inference

Arxiv

0+阅读 · 3月19日

Statistical Inference for Online Algorithms

Arxiv

0+阅读 · 3月18日

相关基金

抽样环境下基于流记录的行为特征分析与多分类器识别模型研究

国家自然科学基金

0+阅读 · 2015年12月31日

基于数据特征选择与匹配的工业过程监测方法研究

国家自然科学基金

1+阅读 · 2015年12月31日

非确定型Web服务流程重组的可靠性验证技术

国家自然科学基金

1+阅读 · 2015年12月31日

基于复杂数据的回归模型统计推断及其应用

国家自然科学基金

3+阅读 · 2015年12月31日

多重排序数据的整合分析

国家自然科学基金

0+阅读 · 2015年12月31日

试验设计中的模型选择

国家自然科学基金

6+阅读 · 2014年12月31日

测量误差数据下约束线性模型的有偏估计及变量选择研究

国家自然科学基金

0+阅读 · 2014年12月31日

相依回归模型与扩散过程的统计推断及其应用

国家自然科学基金

1+阅读 · 2014年12月31日

流程监控与评估中多元数据整合研究

国家自然科学基金

1+阅读 · 2014年12月31日

因果推断的统计方法

国家自然科学基金

26+阅读 · 2011年12月31日

微信扫码咨询专知VIP会员