We propose a computationally simple framework for clustering functional data based on Gaussian-process-generated random projections. In this approach, each curve is first projected onto a large collection of independent Gaussian process realizations. The resulting high-dimensional representations are clustered using the Mean Absolute Difference of Distances (MADD), a dissimilarity measure well suited for high-dimensional settings. A population-level analysis of this dissimilarity provides insight into how random projections help capture distributional differences between functional populations. We introduce a second stage of clustering to additionally leverage on data-driven projection directions. Thus, in Stage I, an initial clustering is obtained using a set of prespecified projection families. In Stage II, this partition is refined by constructing Gaussian random projections based on an estimated covariance operator that uses the first stage of cluster labels. Finally, a normalized cost function is used to select the optimal clustering among candidate solutions. The proposed clustering algorithm is broadly applicable to diverse functional data regimes including irregular and partially observed data. Through extensive simulations and real-data applications, we show that the proposed method achieves a high degree of accuracy and outperforms many of the state-of-the-art methods across a wide range of functional data settings.
翻译:我们提出一种基于高斯过程生成随机投影的函数型数据聚类计算框架。该方法首先将每条曲线投影至大量独立高斯过程实现上,随后采用适用于高维场景的相异度度量——平均绝对距离差(MADD)对所得高维表征进行聚类。对该相异度的总体层面分析揭示了随机投影如何帮助捕获函数总体间的分布差异。为进一步利用数据驱动的投影方向,我们引入第二聚类阶段:在第一阶段,通过预设投影族获得初始聚类结果;第二阶段则基于第一阶段聚类标签构建估计协方差算子,据此构造高斯随机投影以优化聚类划分。最终采用归一化代价函数从候选解中选取最优聚类方案。所提出的聚类算法广泛适用于包括非规则观测和不完全观测在内的各类函数型数据场景。通过大量模拟实验与真实数据应用,我们证明该方法能实现高精度聚类,并在多种函数型数据环境下优于诸多前沿方法。