Big data, with NxP dimension where N is extremely large, has created new challenges for data analysis, particularly in the realm of creating meaningful clusters of data. Clustering techniques, such as K-means or hierarchical clustering are popular methods for performing exploratory analysis on large datasets. Unfortunately, these methods are not always possible to apply to big data due to memory or time constraints generated by calculations of order PxN(N-1). To circumvent this problem, typically, the clustering technique is applied to a random sample drawn from the dataset: however, a weakness is that the structure of the dataset, particularly at the edges, is not necessarily maintained. We propose a new solution through the concept of "data nuggets", which reduce a large dataset into a small collection of nuggets of data, each containing a center, weight, and scale parameter. The data nuggets are then input into algorithms that compute methods such as principal components analysis and clustering in a more computationally efficient manner. We show the consistency of the data nuggets-based covariance estimator and apply the methodology of data nuggets to perform exploratory analysis of a flow cytometry dataset containing over one million observations using PCA and K-means clustering for weighted observations. Supplementary materials for this article are available online.
翻译:大数据,其维度为N×P且N极大,给数据分析带来了新挑战,特别是在数据有效聚类方面。K-means或层次聚类等聚类技术是对大型数据集进行探索性分析的常用方法。然而,由于计算复杂度为P×N(N-1)级别,这些方法受内存或时间限制,往往难以直接应用于大数据。为解决此问题,通常将聚类技术应用于数据集中的随机样本,但该方法的缺陷在于数据集的整体结构(尤其是边缘部分)难以得到有效保留。我们提出一种基于“数据块”概念的新方案:将大型数据集缩减为少量数据块,每个数据块包含中心、权重和尺度参数。这些数据块可输入主成分分析和聚类等算法,实现更高的计算效率。我们证明了基于数据块的协方差估计量的一致性,并将该方法应用于含百万余观测值的流式细胞术数据集,通过加权观测的PCA和K-means聚类完成探索性分析。本文补充材料可在网上获取。