Despite its extensive development for multivariate data, semi-supervised learning remains underdeveloped for functional data. To address this challenge, we extend the Fermat distance, a density-sensitive metric aligning with the semi-supervised setting, to the functional domain. Leveraging the Fermat distance, we propose novel semi-supervised classifiers, including the weighted $k$-nearest neighbors (NN) classifier and multidimensional scaling (MDS)-induced classifiers. To accommodate massive datasets commonly seen in semi-supervised applications, we design a computationally efficient estimation procedure tailored for discrete and noisy functional observations. Theoretically, we establish exponentially decaying convergence rates of the $k$-NN classifier and the consistency of the estimated Fermat distance. Crucially, our results reveal a phenomenon unique to error-contaminated functional data: Incorporating unlabeled data leads to improved classification accuracy only when the individual sampling rate grows sufficiently fast. Applying our framework to simulated data and a large-scale dataset of Gaia astronomical spectra, we demonstrate that our proposed semi-supervised classifiers uniformly outperform existing supervised benchmarks.
翻译:尽管针对多元数据的半监督学习已有广泛发展,但该方法在函数型数据领域仍相对滞后。为应对这一挑战,我们将费马距离——一种与半监督场景相契合的密度敏感度量——扩展至函数域。基于费马距离,我们提出了新型半监督分类器,包括加权$k$近邻分类器与多维缩放诱导分类器。为适应半监督应用中常见的大规模数据集,我们设计了针对离散含噪函数观测值的高效计算估计流程。在理论上,我们建立了$k$近邻分类器的指数衰减收敛速率及费马距离估计的一致性。关键的是,我们的结果揭示了一个误差污染函数型数据特有的现象:仅当个体采样率增长足够快时,纳入未标注数据才能提升分类精度。通过将框架应用于模拟数据及盖亚天文光谱大规模数据集,我们证明所提出的半监督分类器在性能上全面超越现有监督基准方法。