This paper investigates clustering in survival data by shifting the analytical focus from cumulative survival probabilities to instantaneous risk, as characterized by the hazard function. We model smoothed log-hazard trajectories as functional objects that capture the temporal evolution of risk and propose a clustering framework based on Functional Principal Component Analysis applied to B-spline smoothed log-hazard trajectories. The number of retained functional principal components is selected before clustering using a 95% cumulative explained-variance rule, and clustering is then performed on the unstandardized FPCA scores. The proposed methodology is evaluated through simulation studies covering progressively complex scenarios, including overlapping and crossing hazard functions, cohort imbalance, heterogeneous risk profiles, and outlier contamination. The framework is further illustrated on two real-world clinical datasets, the German Breast Cancer Study and the Primary Biliary Cirrhosis dataset. Results show that the proposed log-hazard-based functional clustering framework provides an interpretable representation of relative temporal risk dynamics, with competitive internal cohesion and explicit robustness diagnostics when compared with cumulative-survival-based benchmarks.
翻译:本文通过将分析焦点从累积生存概率转向瞬时风险(以风险函数为特征),探讨生存数据中的聚类问题。我们将平滑后的对数风险轨迹建模为捕捉风险时间演变的函数对象,并提出一种基于对B样条平滑对数风险轨迹进行函数主成分分析的聚类框架。在聚类前,采用95%累积解释方差规则选择保留的函数主成分数量,随后对未标准化的FPCA得分进行聚类。通过涵盖渐进复杂场景(包括风险函数重叠与交叉、队列不平衡、异质性风险概况及异常值污染)的模拟研究评估所提方法。该框架进一步在德国乳腺癌研究数据集和原发性胆汁性肝硬化数据集两个真实临床数据中验证。结果表明,与基于累积生存的基准方法相比,所提出的基于对数风险的函数聚类框架能够提供相对时间风险动力学的可解释表示,具有竞争力的内部凝聚性及显式鲁棒性诊断能力。