The classical two-sample test of Kolmogorov-Smirnov (KS) is widely used to test whether empirical samples come from the same distribution. Even though most statistical packages provide an implementation, carrying out the test in big data settings can be challenging because it requires a full sort of the data. The popular Apache Spark system for big data processing provides a 1-sample KS test, but not the 2-sample version. Moreover, recent Spark versions provide the approxQuantile method for querying $\epsilon$-approximate quantiles. We build on approxQuantile to propose a variation of the classical Kolmogorov-Smirnov two-sample test that constructs approximate cumulative distribution functions (CDF) from $\epsilon$-approximate quantiles. We derive error bounds of the approximate CDF and show how to use this information to carry out KS tests. Psuedocode for the approach requires 15 executable lines. A Python implementation appears in the appendix.
翻译:经典的两样本Kolmogorov-Smirnov(KS)检验广泛应用于经验样本是否来自同一分布的假设检验。尽管大多数统计软件包都提供了该检验的实现,但在大数据场景下执行该检验仍具挑战性,因为需要对数据进行全排序。主流的Apache Spark大数据处理系统提供了单样本KS检验,但未提供两样本版本。此外,近期Spark版本提供了用于查询$\epsilon$近似分位数的approxQuantile方法。我们基于approxQuantile提出一种经典Kolmogorov-Smirnov两样本检验的变体,该方法通过$\epsilon$近似分位数构建近似累积分布函数(CDF)。我们推导了近似CDF的误差界,并展示了如何利用该信息执行KS检验。该方法的伪代码仅需15行可执行语句,附录中提供了Python实现。