A key step in knowledge discovery is the evaluation of data mining results. In several applications, including pattern mining, graph analysis, and others, this step includes the evaluation of the statistical significance of the results, to avoid spurious discoveries due only to noise or random fluctuations in the data. While specialized procedures have been developed for some specific applications, resampling-based approaches are widely used, in particular for complex analyses where analytical results cannot be derived. However, current resampling-based approaches require the generation and analysis of thousands of resampled datasets, and are therefore impractical for large datasets or computationally intensive analyses. In this paper, we introduce FewRS, a simple and effective resampling-based approach to assess the statistical significance of data mining results with rigorous guarantees on the probability of false discoveries. Our approach can be used in every situation where resampling-based approaches are applied. FewRS builds on our derivation of a novel bound to the supremum deviation of test statistics representing the quality of data mining results. We prove that FewRS needs to generate and analyze an extremely small number of resampled datasets, leading to a highly scalable approach with wide applicability. We test our approach on common tasks such as pattern mining and network analysis. In all cases, our approach results in a reduction of up to two orders of magnitude in running time compared to the state of the art, while preserving high statistical power, enabling the statistical validation of data mining results on large-scale real-world datasets.
翻译:知识发现的关键步骤是评估数据挖掘结果。在模式挖掘、图分析等多个应用中,该步骤包括评估结果的统计显著性,以避免因数据中仅由噪声或随机波动导致的虚假发现。尽管已针对某些特定应用开发了专门方法,但基于重采样的方法仍被广泛使用,尤其是在无法推导分析结果的复杂分析中。然而,当前基于重采样的方法需要生成并分析数千个重采样数据集,因此对于大规模数据集或计算密集型分析而言并不实用。本文提出FewRS——一种简单有效的基于重采样的方法,用于评估数据挖掘结果的统计显著性,并对误发现概率提供严格保证。该方法可应用于任何采用重采样的情境。FewRS基于我们对表示数据挖掘结果质量的检验统计量上确界偏差推导出的新界。我们证明FewRS仅需生成并分析极少量的重采样数据集,从而实现了高度可扩展且适用广泛的方法。我们在模式挖掘和网络分析等常见任务上测试该方法。在所有案例中,与现有最先进方法相比,该方法运行时间减少高达两个数量级,同时保持了高统计功效,从而能够在大规模真实数据集上实现对数据挖掘结果的统计验证。